[SCM] libm4ri: library of Method of the Four Russians Inversion annotated tag, debian/20130416-3, created. debian/20130416-3

Mon Jun 17 20:05:00 UTC 2013

The annotated tag, debian/20130416-3 has been created
        at  bfcb10289709309a68323afd514ead2cbea4d058 (tag)
   tagging  8388256663b8819c5021ea583fd929bb72fbb0ea (commit)
 tagged by  Cédric Boutillier
        on  Mon Jun 17 21:44:36 2013 +0200

- Shortlog ------------------------------------------------------------
libm4ri Debian release 20130416-3
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)

iQIcBAABCAAGBQJRv2ckAAoJENpJWPYR4UnpSesP/0YCI27xbYSRR2CgnAJ5RphU
iR6gYp3m3VvCEc9N5TaQozUS4xA8Btljuk5u1fJMqvJvOzJ/xFkC6VgiJYhFHXzg
/8PKD/GzeCJepEIflEJ5UROo0aPNiQDR2qqZvnJa3ecPof1+fEjXmrKq+ZsHOx5H
FZqHIIGd24603V6VqQEl4a5MBaQAoGevm4bVV6OoYWJsVtVFWqKzkTOOh4ISDl6d
b7DEoQpeSRFcyuqHZfzmTbnI/u4kwhL//kJpJsyJwxVZEiCVy6efynZkLbirrbaG
/LohlFGShmwJjl2haw9L+WDUlxglkCAw0fS4QR6CGdy76wh+t1ATH3XQGAJpMgT9
NYYtlM8wOU0eIx8Onk8ps++qs/cIJEW3OzX19vF7IrpSyMhrwYv5pLoG3R3ncGbr
OHsCz9cxTNhalZuLRndz5XHfIS2On+KXpXoBuvWpFluM55Jn1fFJB0Z+4Q9FZIQ0
7yu+7bfVSHnsenXC1sFsJ49Pf8JI7CSNiJ92LVBfuP7OmdRwWHP6E26MOP8JSqhT
/GM6i1SQa4auMJq37Zbv9nuMkU3kLryI87Mrrc6L+9rT2UfHiUPVMN93bW8VXnZU
07/JMbVrRxD8rVpUpvDAGWaM6iD2LGCMr0RS+DJCxTGM7OS4sMWcIiYZdSShLQ83
JugKkEOsNwzrM+QiqDjX
=QfRk
-----END PGP SIGNATURE-----

Alexander Dreyer (1):
      FIX: m4ri called exit in library code

Carlo Wood (105):
      Added cwautomacro's 'autogen.sh' to generate auto tool files.
      Use the canonical 64-bit type for word.
      Get rid of explicit unsigned long long constants.
      Fixed a typo.
      Bit hack speed ups and documentation fixes for misc.h.
      Fix inclusion of misc.h.
      More bit hack improvements.
      More micro optimizations and a bug fix.
      Final micro optimizations in packedmatrix.h.
      Compiler warning fixes.
      Add back the +1 to the result of log2_floor.
      Make sure the correct library is being used at run time for the testsuite.
      Introduction of MIDDLE_BITMASK.
      Hijack testsuite/Makefile to compile matops.
      Rewrite of _mzd_transpose_direct
      Code alignment that makes _mzd_mul_naive-64 20% faster (or not).
      Remove dependency on cwautomacros.
      Introduction of mzd_read_bits_int.
      Implement WRAPWORD
      Make things work for both, g++ and gcc.
      Work in progress
      Explicitely convert a word to BIT.
      Explicitely use CONVERT_TO_WORD every time an int (or BIT) is converter into word.
      Explicitely use CONVERT_TO_UINT64_T when a word is transformed to integer.
      testsuite changes
      Use uint64_t, not word, when we are dealing with 64-bit integers.
      Bug fix for _mzd_transpose_direct
      Use sizeof(word) where appropriate.
      Out of range bug fix.
      Add braces around expressions with & used as truth value.
      Introduction of M4RI_WORDWRAP and the C++ class word.
      Fake merge of dead head
      Reversed the bit order of the internal representation of class word.
      Move reversal from word(uint64_t) to CONVERT_TO_WORD.
      Export reversal from CONVERT_TO_WORD to code.
      Cancel reversals in word::operator-(void) const and WRITE_BIT.
      Change shift operators back to their original state.
      Remove word::operator-(int).
      Added extra asserts to make sure that shifts are within defined range.
      Bring word::convert_to_BIT back to its original state.
      Bring word::convert_to_int back to its original state.
      Remove last traces of reverse.
      Reverse ONE.
      Minor cleanup of misc.h.
      Use int consistently (needed for wordwrapper)
      Remove the FIXME from _mzd_transpose_direct_128
      Removed FIXME from _mzd_addmul_weird_weird
      Type and whitespace clean up.
      Type and whitespace clean up (part 2).
      Merge with https://bitbucket.org/malb/m4ri changeset 7d7a103dfba3
      Benchmark facelift.
      Random benchmark improvements.
      Use TOPSRCDIR Makefile var instead of PWD. Inverse random bits when needed.
      Bug fix in m4ri_random_word.
      Duplicated code of m4ri_randomize m4ri_random_word to benchmarketing.c
      Take BENCH_RANDOM_REVERSE into account in bench_randomize.
      Fixed copyright header in testsuite/test_random.c.
      Added general benchmark program for individual packedmatrix functions.
      Merge with malb
      Improved printed Usage output.
      Bug fix in print_complexity1_human and complexity code updates.
      Minor changes to mzd_first_zero_row
      Packedmatrix benchmark fixes.
      Fix constness of packedmatrix mzd_t input pointers.
      Create a randomize matrix for each call for mzd_gauss_delayed and mzd_echelonize_naive.
      Add LIKELY/UNLIKELY macros for future use
      Fix order of calloc function parameters.
      Added support for PAPI.
      Determine and use LIBPAPI_PATH.
      Also search for papi.h by using -I include flags.
      Prefix all exported variables, functions and macros.
      Move _mmc_ code from misc.h to packedmatrix.c.
      Bug fix for crash of bench_* programs.
      Bug fix, forgot a few instances of CPU_L2_CACHE.
      __M4RI_ENABLE_MMC juggling and support for posix_memalign
      Moved mmc functions to their own file.
      Doxygen warning fixes.
      Merge with malb
      Added forgotten m4/ax_func_posix_memalign.m4
      Compiler warning fixes.
      Allow to only dump a single counter.
      Add dependency on m4ri headers to testsuite.
      Make it harder for the compiler to put parts of inlined functions outside our loop.
      Do not install or include config.h in header files.
      Fix constness of trsm* functions.
      More constness fixes.
      A few more compiler warning fixes and a const thingy.
      More constness and some whitespace issues.
      Add --enable-debug-dump.
      Documentation fix.
      Add new elements to mzd_t and keep them consistent.
      Add option --debug-mzd.
      Move __M4RI_CPU_L1_CACHE and __M4RI_CPU_L2_CACHE to m4ri_config.h.in.
      Added mzd_t::offset_vector and made mzd_t::blocks non-zero also for windowed matrices.
      Added row_offset and accessor functions for mzd_t using it.
      Implement separate cache for mzd_t.
      Compiler warning fixes.
      Rewrite of _mzd_addmul_even_weird to use rowstride.
      Major improvement of transposing.
      Bug fix and general fixups. Testsuite for transpose.
      Bug fix in mzd_equal.
      Speed up of mzd_col_swap with a factor of two.
      Also ignore generated maintainer file ltmain.sh
      Add support for transposing multi-block matrices.
      Copied the improved code of mzd_col_swap to mzd_col_swap_in_rows and added support for start_row/stop_row.

Clement Pernet (10):
      work in progress in lqup
      * add permutation window
      fixing trsm calls to addmul
      * new matrix_addmul with any weird dimensions (still need to be tested)
      Martin patch:"more experimental permutation code, needs testing"
      some more stuff on the weird addmul
      Work in progress on the LQUP front: fixed a bunch of bugs, and get LQUP working on full rank matrices.
      fix LQUP doctest
      Added the 2 remaining trsm and the corresponding testsuite and benchmarks.
      Switch PLUQ -> LQUP

Cédric Boutillier (43):
      set distribution to UNRELEASED and add a -1 Debian version
      Convert to 3.0 (quilt) Debian source format
      Add build-dependency on dh-autoreconf
      change upstream tag format for git-buildpackage
      Do not ship .la file in libm4ri-dev
      Set debhelper compatibility level to 9
      Build-depend on dpkg-dev >= 1.16.1, add hardening
      Add a debian/watch file
      Add unapply-patches in debian/source/local-options
      Update upstream-versioning-change patch to fix SONAME
      Add strict dependency on the binary lib for the -dev package
      Pre-depend on multiarch-support, add Multi-Arch: same
      Fix *.install to use multiarch paths
      update changelog
      convert copyright file to copyright-format/1.0
      debian/watch: use uversionmangle instead of dversionmangle
      Merge commit 'release-20120613'
      prepare for new upstream version
      disable upstream-version-change patch
      update copyright info
      use upstream version numbering
      remove changelog entries for versions which never made it to the archive
      point to /usr/share/common-licenses/GPL-x for GPL-x+ license text
      Add myself to Uploaders
      disable sse2 flag
      override lintian message about the absence of upstream changelog
      target experimental
      Add VCS-* fields; Bump Standards-Version: to 3.9.4 (no changes needed)
      use canonical value for Vcs-Git: field
      Build-depend on pkg-config, libpng-dev (Closes: #699071)
      add OpenMP support
      add debug package
      reformat debian/control with cme fix dpkg-control
      update changelog
      Merge release-20130416
      prepare for 20130416-1
      add patches to enable sse2 for non Intel cpu and disable sse3
      remove upstream-versioning-change patch (not needed anymore)
      do not disable sse2 for x86_64 CPUs
      add debian/upstream file
      update changelog
      use DEB_HOST_ARCH_CPU instead of DEB_BUILD_ARCH_CPU
      upload to unstable

Felix Salfelder (13):
      libm4ri_0.0.20080521.orig.tar.gz
      imported debian from libm4ri_0.0.20080521-2.diff.gz
      Merge commit 'release-20111004'
      debian/0.0.20111004
      Merge commit 'release-20111203'
      debian/0.0.20111203
      update autogenerated files (do we need them?)
      debian/0.0.20111203-1.
      remove autogenerated files
      switch to dh
      Merge commit 'release-20120415'
      new upstream release, 20120415
      removed ltmain.dh (autogenerated)

Jean-Guillaume Dumas (1):
      * added is_zero

Martin Albrecht (480):
      initial commit
      - refactoring (renaming of functions, files)
      Strassen seems to work if the matrix dimensions are exactly right
      added support for SSE2 instructions (for now these need to be enabled by hand). The speed-up is
      Strassen multiplication seems to work now
      - added support for SSE2 if available (autodetection)
      fix build on PPC
      continued refactoring (should be almost done) and fixed bug in naiv multiplication
      refactoring should be done
      simplified combine, don't try to outsmart the compiler
      doxygen updates
      a potentially more cache-friendly implementation, needs checking
      misc cleanups
      fix version-info
      Doxygen coverage 100%
      implemented memory efficient strassen multiplication operation schedule
      removed dead test code, added strassen.h to m4ri.h
      moved mzd_combine to packedmatrix.[c|h]
      SAFECHAR =  (1.3 * RADIX) is sufficient
      slightly improved clearing of target matrix in _mzd_mul_m4rm_impl
      marking more parameters const
      some cosmetic changes to packedmatrix.c
      declaring more parameters const
      docstring updates and API unification
      fixed compilation under OSX (32-bit) and under OpenSolaris (32-bit)
      remove unecessary local variables, add explicit casts as picked up by MSVC
      added support for Visual Studio 2008 Express
      using XOR directly rather than calling mzd_combine gives a significant speed-up so we do that for now. Need to check if this is related to SSE2 and if we can re-introduce it
      adapt documentation: We use Strassen-Winograd not Strassen
      more documentation for the Opteron vs. Core2Duo performance compromise
      reintroducing SSE2 to m4rm multiplication
      unify SSE2_CUTOFF
      fix SIGSEGV
      some minor documentation updates
      compile fix for HAVE_SSE2 == False
      don't use free on _mm_malloc'd memory
      fixing benchmarking/testing code and adding it to revision control
      only call _mm_malloc if it is really available
      faster naiv multiplication but still not as fast as is could be.
      added William Hart's Block M4RM implementation which gives a significant speed-up!
      document M4RM_BLOCKSIZE
      make run_bench return min,median,average and max
      nicer parameter names for mzd_combine
      re-added SSE2 support to mul_m4rm which gives a quite tiny speed-up
      faster transpose
      block'ing naiv matrix multiplication and using that by default if B->ncols < some threshold
      reverting benchmarking code to square matrices
      copy window to matrix to improve data locality in strassen multiplication
      fix commenting style
      removed parameters T and L for M4RM (they weren't used anyway)
      new implementation of M4RM multiplication with two Gray code tables. The idea is by Bill Hart
      implemented first parallel strassen-winograd multiplication (compile with -fopnemp -DHAVE_OPENMP)
      some (style) improvements for SSE2 code by Bill Hart
      fixes for the last check-in (all rows are aligned now if no windows are used)
      use 8 instead of 2 Graycode tables (implementation and idea by Bill Hart)
      allow control over number of Gray code tables via define GRAY8
      added support for SSE2 to new _mzd_mul_m4rm_impl this improves performance on C2D considerably,
      fix bug in reduction introduced by speeding up make_table
      added new testcase, cleanup for valgrind
      added more test (corner) cases
      fixed bug Bill Hart reported, fix all things Valgrind reported and made code run faster on C2D.
      added Bill's cutoff improvement
      make OpenMP support configurable
      updated MSVC project, added all relevant headers to m4ri.h
      fix include order
      slightly more clever loop unrolling using a Duff device, doesn't make much of a difference
      remove unused variable
      slight simplification for process rows and HAVE_SSE2
      new M4RI1 routine for matrix reduction, which is still buggy for singular matrices
      more small work m4ri1, this is buggy, experimental, play-around code
      Michael Brickenstein:
      M4RI doesn't fall back to Gaussian elimination so easily anymore. In fact, it never does. This
      remove mzd_process_row and changed interface for mzd_process_rows to treate stoprow exclusive (this is more C-ish)
      speed improvement for M4RI
      more speed improvements for M4RI
      implemented using two Gray code tables at the same time, which improves performance.
      some slight improvement to mzd_row_add_offset
      removing number of parallel processed rows to two.
      implement lazy strategy, i.e. attempt to not reduce rows already reduced.
      removed old commented-out reduce implementation
      removed references to old implementations
      renamed GRAY8 macro to M4RM_GRAY8 since it only applies to multiplication
      avoid potential memleak in shared library mode where the Gray codes are rebuild several times.
      another attempt at speed improvements
      4 Graycode tables seem to be good, need to test on Opteron. For large matrices we hit L2 so
      don't reduce a row if it is already reduced, slight overhead for random matrices, huge gain
      big check-in (sorry):
      added documentation for lacking bounds checks
      fix Gaussian reduction for full=FALSE, reported by Wael Said
      slightly improved the k parameter for reduction, the M4RM k parameter can be adapted for the Core2
      adapted parameter k for top_reduce too
      work in progress: mzd_addmul_strassen
      fix printing for ncols%RADIX == 0
      fix typo in documentation
      implemented memory efficient addmul
      added mzd_col_swap
      fixed dimensions of X0,X1,X2 in addmul_strassen
      added a bunch of functions and CHANGED THE API!
      macros more robust by adding lots of brackets
      first version of col_rotate
      2nd attempt at col_rotate, doesn't update permutation yet
      M4/autoconf trickery
      sane default value for Strassen cutoff
      API CHANGE, dropping all _impl's. also improved MP Strassen slightly
      merging Clement's patch, everything should work
      initial untested code for permutations
      checking in all files that automake doesn't autogenerate
      commenting stuff out that prevents the build
      patch bomb:
      int/long -> size_t cleanup courtesy of MSVC
      added cached memory management option, which is disabled since it doesn't seem to make a difference
      removed -fopenmp
      if create/destroy_all_codes is called twice ignore the second call.
      renamed combineX_sse2 to combineX
      improved and enabled memory manager, also introduced shared library constructors and destructors. These seem to work with GCC, needs
      thread safe-ness + refined lib constructor/destructor
      added extern "C" safeguard
      quick rename of one variable, trivial
      __SUNCC__ -> __SUNPRO_C__, untested
      added "proximity schedule" from FFLAS, but that doesn't seem to improve performance
      removed proximity schedule again
      adapted parameters for Opteron
      changed strategy for parallel multiplication to block-parallel-then-strassen
      updated README and AUTHORS
      preparation for next release (targeted: Sunday)
      renamed reduction to elimination
      new strategy for k in M4RI, seems to work well on Opteron and C2D
      define CPU_L2_CACHE in misc.h if it isn't there already
      - fix compilation with MSVC
      fix docs
      merge of Clement Pernet's patch:
      slight coding-style clean-up after merging Clement's patch
      documentation update
      new strategy for k for multiplication, should fit Opteron and C2D
      fix a SIGSEGV and sometimes wrong results for matrix multiplication
      new release
      Added tag release-20080826 for changeset 6b307aa254cb
      work on LQUP (or LUP right now)
      fix warnings issues by ICC & remove unused watch.c/.h
      removed watch.h from m4ri.h
      more work on LUP, still not correct
      fix memleak in addmul
      more scratch code for LQUP
      release 20080901
      Added tag release-20080901 for changeset bf3d55ccb73b
      fix cache size detection handling
      fix/unify bit shifting bugs as exposed on Itanium
      checking in  Arnaud Bergeron's cache detection fix for PPC + my adaptation
      ... and reverted my changes again since they don't work
      added RIGHT_BITMASK equivalent for LEFT_BITMASK and (hopefully) made the code more readable
      Added tag release-20080904 for changeset ce71e2c84ad1
      suppress redundant output
      some work on LQUP basecase, not working yet
      checking in fix by anakha
      playing around with LQUP
      anakha's fix again for configure's cache detection
      some more work on LQUP but not working yet
      LQUP basecase (slow but seemingly correct for square matrices)
      make C++ compiler happy by fixing m4ri_die's signature
      fix mzd_add speed regression
      fixed bug in trsm routines
      better benchmarketing code
      improved Makefile.am and added make check
      new release
      enabling LQUP doctests
      fix two MSVC warnings
      update MSVC project file
      fix doctest failure under OpenSolaris
      I'm just playing with MMPF LQUP (not to be taken seriously)
      update/correct license statement in source files. M4RI was always GPLv2+
      do not use rowswap array to swap rows, always copy:
      improved testsuite build process
      faster LQUP (use Strassen instead of M4RM only) and more comprehensive test suite
      PLUQ work in progress
      PLUQ MMPF work in progress
      fixed a bug introduced by fixing RIGHT_BITMASK
      some minor clean-up after fixing the TRSM tests
      remove some assert(M->offset==0)
      renamed LQUP -> PLUQ
      MMPF: deleting L for now for debuggin purposes, once Q is correct, don't kill L
      added method for permutation printing
      Q seems to be correct now for MMPF
      bumped version in Makefile.am to aim for release for end of month
      added optimized function for v*A where v is a (1,d) vector and A is a (d,d) matrix. The code is
      removed a lot of old functions that were not needed anymore
      added (untested) mzd_copy_row function. The function is based on Michael Brickenstein's copy_row.
      and added mzd_row_clear_offset again
      PLUQ permutations are still wrong, MMPF might be alright
      cleaned up some cruft left over from debugging sessions
      mzd_submatrix accepts offsets now
      remove debug printing
      documented/cleanup up MMPF
      mzd_col_swap is a bit faster now, fixed memleak in bench_lqup
      better crossover and 'better' Q update
      doctest should cat LQUP failures for smaller examples
      -fixed spelling of naive across the board
      a supposedly working PLUQ implementation (doesn't work with MMPF yet)
      PLUQ factorisation with MMPF base case seems to be working!
      added m4ri_random_word (check randomness of output)
      use m4ri_random_word in mzd_randomize (todo: check randomness)
      better strategy for column swaps in mzd_pluq_mmpf (still way too slow for matrices with r << n)
      factored out PLUQ MMPF and wrote faster MMPF routine
      PLUQ is really really slow for e.g. half ranks. Some code to fix this but no luck so far
      faster M4RI for sparse matrices
      improved (faster) pivot search in MMPF
      clarified documentation
      allow half rank in bench_elimination
      factored out pivot finding to fast subroutine
      commented out SSE2 attempt for mzd_col_swap
      massive speed-up for sparse matrices
      use fast pivot searching code in mzd_reduce_m4ri
      slightly faster column swaps?
      implemented mzd_echelonize_pluq (mzd_echelonize_FOO is so much better than mzd_reduce_FOO)
      fixed MMPF
      mzd_row_add_offset not static inline anymore
      changed API and updated docs: mzd_reduce_ mzd_echelonize_
      mzd_print_matrix -> mzd_print; mzd_mul_m4rm_t removed
      merge mzd_row_add_offset move
      updated AUTHORS
      made some todos more visible
      fixed solve for full rank A
      fixed warning in test_solve
      yet another printf fix
      added COPYING file to repository because autotools insist on GPLv3+ while we're GPLv2+
      new strategy for dealing with not-full rank submatrices in MMPF
      renamed a fullrank -> halfrank in testroutine
      better handling of sparse matrices in MMPF
      more testcases (mzd_echelonize_m4ri() fails)
      fixed some minor bug in TRSM
      fixed PLUQ MMPF bug
      improved bench_elimination to allow choice of algorithms (mmpf, pluq, naive)
      spend less time in mzd_process_rows when in mzd_process_rows2_pluq
      added mzd_density()
      apply_p_right_trans() more cache friendly
      improved cache friendliness of column swaps in LQUP
      improvement for sparse matrices in M4RI
      fixed memleak in test_solve
      preparing for release 20090105
      fixed doxygen warning
      fixed MSVC compiler warnings
      updated MSVC project to include pluq_mmpf and solve
      Added tag release-20090105 for changeset 0b25b0a1474a
      small clean-up in mzd_cooy()
      make bench targets depend on Makefile
      make m4ri_coin_flip static inline to remove noise from oprofile run
      inlining a couple of often called functions, this should help a bit
      renamed Macros' functions _evenb -> _even and the original functions _even -> even_orig to
      fixed a SIGSEGV in mzd_echelonize_pluq() when full==0
      remove unecessary if() statements in mzd_pluq_mmpf()
      improved documentation (added docs on return values) and removed redundant parameters from mzd_echelonize_m4ri()
      some trivial doxygen fixes
      fix bench_elimination.c vs. new mzd_echelonize_m4ri() API.
      call _mzd_mul_va from mzd_mul_naive if appropriate
      refactoreded packedmatrix to allow more than one malloc call to allocate the matrix
      cleaning up the new code
      added Macro as an author
      fixed a few warnings and one error as reported by MSVC
      set max malloc size to 1GB
      fixed release soname
      implemented mzd_kernel_left_pluq to compute kernels via PLUQ
      added test code for mzd_kernel_left_pluq()
      fixed testcode for mzd_kernel_left_pluq()
      do not prepend zero in cache size detection since that will trigger octal interpretation of the result
      yet another fixing attempt for cache size detection
      fix L1 detection on OSX x86
      fix compilation with --enable-openmp
      experiments with OpenMP
      fixing OpenMP doctest failures
      added low leverl parallelisation ot process_rows2_pluq and added  that the parallel sections in mzd_mul_mp_even() should use num_threads(4)
      switch back to using threads if any additional thread is available, don't require at least four
      fix bug in mzd_is_zero() where small zero matrices wouldn't be reported as such
      implemented adding 3 and 4 rows in one step for PLUQ MMPF and adapted constants accordingly
      only swap at the end of the base case not during while finding the pivots. This allows a more
      made mzd_apply_p_right and mzd_apply_p_right_trans more efficient to decrease the penalty of column swaps.
      Added tag release-20090617 for changeset 46b89e01b348
      switching MMPF from PLUQ to LQUP and enabling it
      improved performance for LQUP factorisation to roughly match that of PLUQ, still work to be done
      fixed a bug which escaped me for the last check in because I didnt check with cutoff=64
      some performance improvements for sparse-ish matrices
      don't apply permutation if todo rows == 0
      added _pluq_mmpf back for debugging etc.
      copy submatrix to temporary when switching to MMPF
      use L2_CACHE_SIZE for PLUQ cutoff (experimental)
      improve performance of mzd_transpose using Hacker's Delight bit-fiddling trick (closes: #15)
      don't check for the number of CPUs on configure. The macro is not cross platforms and we don't use it anyway (fixes #16)
      whoops, forgot to check in configure.in
      implemented timing experiment to calculate L1 and L2 cache size. This isn't working perfectly yet and thus it is only optional for now.
      moving 'step 1.5' of LQUP MMPF to _mzd_lqup_submatrix because it caused confusion that the postprocessing is outside of that function.
      fixed potential segmentaton fault in mzd_row_add_offset
      merge
      changing the soname version to 20091101 in preparation for new release
      fix bug which lead to wrong results on t2.math.washington.edu
      another sizeof(size_t) != sizeof(word) bug
      fixing warnings/errors reported by Microsoft Visual Studio
      Added tag release-20091101 for changeset 66644740d92d
      fixed doxygen warnings
      defaulting to '0' instead of 'unkown' in ax_cache_size.m4. This should make things more cross-platform
      considerable protability improvement in configure.ac due to David Kirkby
      only perform column swaps on non-zero rows in mzd_echelonize_pluq. For some sparse matrices, this gives an advantage
      renamed mzd_apply_p_right_tri to mzd_apply_p_right_trans_tri because this is what it does
      fixed a bug in permutation which caused segfaults (cf. Sage #8301)
      be slightly more clever about selecting 'k' in _mzd_lqup_mmpf() by mirroring M4RI strategy
      revert temporary switch to _mzd_lqup_naive in _mzd_lqup (it was just a benchmarketing test)
      updated to current Debian version (this file should be removed from revision control, it doesn't belong here)
      current OpenMP complaints about return from critical blocks, also removed nested criticial blocks
      tuned OpenMP parameters for M4RI on sage.math
      implemented heuristic algorithm which starts with M4RI and switches to PLS based
      * renamed LQUP functions and filenames to PLS
      fixed docstring for PLS decomposition
      updating Visual Studio project
      allow the user to disable SSE2 instructions
      fix default paramters in configure
      Added tag release-20100701 for changeset 8513835b2a92
      exporting all mzd_process_rowsX variants
      refactoring to allow m4rie to reach into some of our fast routines
      wide should be a size_t
      Cygwin requires no-undefined, otherwise no shared lib is built
      make sure the memory managers match!
      more robust cache tuning by increasing the number of trials
      improved speed of cache tuning, seems to give good results on t2,bsd,road,prai243,redhawk,eno,iras
      new release
      Added tag release-20100817 for changeset 6758e6a445c0
      fixing solving (for systems which are consistent)
      yet another fix for system solving. Inconsistent systems *are* detected despite
      fixing segfault in corner case of solve.c
      implemented simple TRSM upper left using Greasing
      rewrote mzd_make_table in order to support offset!=0 needed for M4RI based TRSM (experimental)
      a more comprehensive test suite for TRSM
      adding optional randomized tests for TRSM
      new TRSM passing all tests now
      slight speed improvement for TRSM upper left
      package passes make distcheck now
      adapt testsuite to new build structure
      allow generic ranks in bench_elimination so we can improve rank sensitivity
      new function _mzd_compress_l which implements compression of L for PLS
      more work on compression of L
      optimised compression L in _mzd_pls() (fixes #23)
      benchmark(et)ing code for sparse-ish matrices
      _mzd_pls_submatrix() only considers the currently needed words instead of of whole rows
      don't compute the full PLUQ in mzd_echelonize_pluq() if full=0
      slightly better _mzd_combine
      *** empty log message ***
      set the random seed to a fixed value to allow reproducible tests/benchmarks
      added benchmarketing "framework" for getting more reliable timings out of bench_ files.
      adding swap_bits() function to easy transition for third parties to new matrix layout
      removing work arounds for compiler bug (not properly alligned loops) since they are not cross platform
      merging with Carlo's random() fixes
      adding autogenerated files to .hgignore
      improved mzd_add
      merge with Carlo's copyright update
      I foolishly forgot to add some of the newly added files
      correcting a few minor things in bench_packedmatrix
      merging in Carlo's benchmarking patches
      bench_smallops made obsolete by bench_packedmatrix
      matops doesn't exist anymore
      add hack to add PAPI include directory.
      allow --with-papi=PREFIX when AC_CHECK_LIB cannot find it (i.e. the case it is meant for)
      remove ltmain.sh which is autogenerated
      install debug_dump.h otherwise programs linking against the library will fail to compile
      initialise variables (i.e., take care of Wall reported errors)
      follow-up check-in for cache size fix
      do not fail if realpath is not installed
      only set HAVE_PAPI if we have papi
      merging Carlos' swap patches
      adapting release version
      MS Visual Studio 10 support
      fixed typo which prevented compilation
      xor is a restricted keyword in C++
      updating README and AUTHORS for upcoming 20110601 release
      print cycles per bit in bench elimination and multiplication
      fixing printing of benchmarketing information
      fix compilation and segfaults when OpenMP is enabled
      zero out transpose target matrix before writing to it
      documentation update on PLE factorisation
      disable manual zero-ing out the transpose matrix since our tests indicate it happens on the fly anyway. Added tests though.
      adding m4ri_spread_bits and m4ri_shrink_bits + testcases
      fix memleak in vector_destruct()
      revise PLE decomposition to match new block-iterative algorithm.
      flush the buffer in tuning such that the user gets feedback that we are not hanging
      use less iterations per experiment in cache tuning but more experiments
      added option to pass cache sizes explicitly to configure
      adapting soname for upcoming release
      Added tag release-20110601 for changeset 75bcfb497a80
      Added tag release-20110613 for changeset 68c0b623b59a
      Added tag release-20110715 for changeset ab55c3167691
      use new-style config.h
      handle cflags better
      changing version to 20110901 for upcoming release
      Added tag release-20110901 for changeset 753358af056e
      mzd_cmp() should not compare stuff after ncols
      whitespace stuff
      bugfix in mzd_is_zero()
      renamed SIMD_FLAGS to SIMD_CFLAGS && defined __M4RI_SIMD_CFLAGS and __M4RI_OPENMP_CFLAGS in m4ri_config.h
      ... and fixed a bug in the last check-in
      changing version number for upcoming release
      Added tag release-20111004 for changeset 7453821cbd9b
      PNG reading/writing & reading of JCF's sparse matrix format
      bugfix reading/writing png
      invert mono doesn't work for colormap so we do it by hand
      removed misleading "this file is broken" comment, it *should* not be broken.
      shipping pkg.m4 for systems which don't have pkg-config such as OpenSolaris
      use MATHJAX when generating Doxygen HTML docs
      fixed png stuff for machines without libpng (i.e., not building png stuff)
      changed library release to 20111203
      Added tag release-20111203 for changeset 8c2115cc469c
      allow -n 1 in benchmarketing code
      removed trailing whitespaces
      mzd_invert_m4ri replaced by mzd_inv_m4ri() with sane interface, implemented mzd_invert_upper_m4ri for inverting upper triangular matrices
      benchmarking & testing code for inversion
      bugfixes & asymptotically fast inversion for upper triangular matrices
      benchmarking for asymptotically fast trtri for upper triangular matrices
      improved benchmarketing code for PLUQ/PLE decomposition (still called PLS in code)
      faster & nices trtri for upper triangular matrices
      refactoring: renamed files & functions (probably a bit moe clean up to do)
      fixed cutting of matrices to fix problems with SSE2 instructions
      fixed spelling of Kronrod
      Added bench_trsm which unifies bench_trsm_*
      removing bench_trsm_* which are replaced by bench_trsm
      mzd_extract_u & mzd_extract_l and faster TRSM upper right
      corrected guardian ifdef
      fixed reference to bench_packedmatrix
      define __STDC_LIMIT_MACROS before including <stdint.h> (reported by Jerry James)
      fixes #38 (fix suggested by anonymously)
      allow for really really big matrices (fixes #39)
      do not use HAVE_OPENMP, always use __M4RI_HAVE_OPENMP (fixes #41)
      limit size of mzd_t cache to 16*64 (fixes #40)
      mzd_transpose() on 0x0 matrix should not crash (fixes #31)
      open files in binary mode (for Windows) (fixes: #43)
      move OMP stuff to lower level (which gives better results on my 4 core i7)
      a few small changes to make scan-build shut up
      mzd_inv_m4ri makes no guarantees what happens when the input is not invertible
      fixed nasty bug in mzd_t_malloc/free
      Added tag release-20120415 for changeset cb1d737cb43e
      fixed compiler warning: "expected long long but got word"
      remove _startblock's from mzd_combine_weird (it ignored them anyway). fixes #47
      be a bit more careful about what to add to LIBADD for libpng
      next try to fix png detection
      library version to 20120615
      drop -lm
      preparing for upcoming release
      changing release management to make fuck-up's less likely
      some whitespace changes so stuff is aligned
      fixed bug in mzd_col_swap_in_rows which would break if more than one block was used per matrix
      Added tag release-20120613 for changeset d68372939136
      removed old unused code and trailing whitespaces in strassen.c
      remove autogenerated files
      ignore autogenerated files
      prettier printing for bench_multiplication
      added PAPI support to bench_multiplication (still a bit dirty, might need cleaning up)
      use slightly less instructions in M4RI by avoiding mzd_read_bits calls
      slight improvement to PLE base-case
      use SSE2 more in xor.h
      use less instructions to read bits in ple_russian
      added support for PAPI in bench_elimination
      more data locality in multiplication via copying
      detect L3 cache and use it instead of L2 cache
      simplified + slightly faster M4RM code
      use more tables in ple_russian
      fixing bugs introduced in last two patches
      faster benchmarking code for bench_elimination
      some more fine-tuning of the parameter k
      enforce kk <= m4ri_radix
      clarifies documentation for JCF format reader and fixes bug in indexing (see #49)
      fixed autotools scripts for make dist to work (better)
      more fiddling around with cache tuning
      explicitly prefix m4ri to all includes, i.e., #include <m4ri/foo.h>
      deprecating cache tuning
      fall back to L2 size if L3 size is not detected/not present
      limit k to 6, which seems to give best performance for now
      more flexible bench_multiplication parameter parsing
      changing version for next release: 20121224
      we don't actually need 2.64
      choose better values for k if matrices are small
      process rows is pretty internal
      slapping a bunch of pragma omp parallel for's on the code, no guarantees whatsoever
      for small matrices _mzd_ple_to_e was a bottleneck, now it isn't
      adding mzd_process_rowsX_ple back as it is used in triangular_russian.c
      aligning at 64-byte boundaries as per advice of Richard Parker
      adding mzp_copy() for convenience
      preparing for upcoming release

Mate Soos (1):
      "long long" is not ANSI-compliant. Chaning it to 'word'

Michael Brickenstein (1):
      don't bail out in mm_malloc if asked for nothing

Minh Van Nguyen (1):
      more user friendly documentation in README

Peter Jeremey (1):
      this patch solves:

Tobias Hansen (2):
      Fix gbp.conf
      Fix gbp.conf

bodrato at localhost (1):
      New Strassen-like sequence for multiplication, and squaring.

bodrato at mail.dm.unipi.it (1):
      Added new functions for addmul and addsqr using new sequences.

-----------------------------------------------------------------------

-- 
libm4ri: library of Method of the Four Russians Inversion