[python-hdf5storage] 72/84: Added documentation page for compression. (cherry picked from commit e8b33a0bf1b4c0897fc7ae7491143ed2000d4325)

Ghislain Vaillant ghisvail-guest at moszumanska.debian.org
Mon Feb 29 08:25:05 UTC 2016


This is an automated email from the git hooks/post-receive script.

ghisvail-guest pushed a commit to annotated tag 0.1.10
in repository python-hdf5storage.

commit 55992e285703e81f3226895deb8575c92b074b83
Author: Freja Nordsiek <fnordsie at gmail.com>
Date:   Tue Sep 1 01:44:22 2015 -0400

    Added documentation page for compression.
    (cherry picked from commit e8b33a0bf1b4c0897fc7ae7491143ed2000d4325)
---
 doc/source/compression.rst | 180 +++++++++++++++++++++++++++++++++++++++++++++
 doc/source/index.rst       |   1 +
 2 files changed, 181 insertions(+)

diff --git a/doc/source/compression.rst b/doc/source/compression.rst
new file mode 100644
index 0000000..13b9c04
--- /dev/null
+++ b/doc/source/compression.rst
@@ -0,0 +1,180 @@
+.. currentmodule:: hdf5storage
+
+===========
+Compression
+===========
+
+.. versionadded:: 0.2
+   
+   HDF5 compression features added along with several options to
+   control it in :py:class:`Options`.
+
+
+.. versionadded:: 0.1.7
+
+   :py:class:`Options` will take the compression options but ignores
+   them.
+
+
+.. warning::
+
+   Passing the compression options for versions earlier than ``0.1.7``
+   will result in an error.
+
+
+The HDF5 libraries and the :py:mod:`h5py` module support transparent
+compression of data in HDF5 files.
+
+The use of compression can sometimes drastically reduce file size, often
+makes it faster to read the data from the file, and sometimes makes it
+faster to write the data. Though, not all data compresses very well and
+can occassionally end up larger after compression than it was
+uncompressed. Compression does cost CPU time both when compressing the
+data and when decompressing it. The reason this can sometimes lead to
+faster read and write times is because disks are very slow and the space
+savings can save enough disk access time to make up for the CPU time.
+
+
+Enabling Compression
+====================
+
+Compression, which is enabled by default, is controlled by setting
+:py:attr:`Options.compress` to ``True`` or passing ``compress=X`` to
+:py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or
+``False``.
+
+
+.. note::
+   
+   Not all python objects written to the HDF5 file will be compressed,
+   or even support compression. For one, :py:mod:`numpy` scalars or any
+   type that is stored as one do not support compression due to
+   limitations of the HDF5 library, though compressing them would be a
+   waste (hence the lack of support).
+
+
+Setting The Minimum Data Size for Compression
+=============================================
+
+Compressing small pieces of data often wastes space (compressed size is
+larger than uncompressed size) and CPU time. Due to this, python objects
+have to be larger than a particular size before this package will
+compress them. The threshold, in bytes, is controlled by setting
+:py:attr:`Options.compress_size_threshold` or passing
+``compress_size_threshold=X`` to :py:func:`write` and
+:py:func:`savemat` where ``X`` is a non-negative integer. The default
+value is 16 KB.
+
+
+Controlling The Compression Algorithm And Level
+===============================================
+
+Many compression algorithms can be used with HDF5 files, though only
+three are common. The Deflate algorithm (sometimes known as the GZIP
+algorithm), LZF algorithm, and SZIP algorithms are the algorithms that
+the HDF5 library is explicitly setup to support. The library has a
+mechanism for adding additional algorithms. Popular ones include the
+BZIP2 and BLOSC algorithms.
+
+The compression algorithm used is controlled by setting
+:py:attr:`Options.compression_algorithm` or passing
+``compression_algorithm=X`` to :py:func:`write` and :py:func:`savemat`.
+``X`` is the ``str`` name of the algorithm. The default is ``'gzip'``
+corresponding to the Deflate/GZIP algorithm.
+
+.. note::
+   
+   As of version ``0.2``, only the Deflate (``X = 'gzip'``), LZF
+   (``X = 'lzf'``), and SZIP (``X = 'szip'``) algorithms are supported.
+
+
+.. note::
+
+   If doing MATLAB compatibility (:py:attr:`Options.matlab_compatible`
+   is ``True``), only the Deflate algorithm is supported.
+
+
+The algorithms, in more detail
+
+GZIP / Deflate (``'gzip'``)
+   The common Deflate algorithm seen in the Unix and Linux ``gzip``
+   utility and the most common compression algorithm used in ZIP files.
+   It is the most compatible algorithm. It achieves good compression and
+   is reasonably fast. It has no patent or license restrictions.
+
+LZF (``'lzf'``)
+   A very fast algorithm but with inferior compression to GZIP/Deflate.
+   It is less commonly used than GZIP/Deflate, but similarly has no
+   patent or license restrictions.
+
+SZIP (``'szip'``)
+   This compression algorithm isn't always available and has patent
+   and license restrictions. See
+   `SZIP License <https://www.hdfgroup.org/doc_resource/SZIP/Commercial_szip.html>`_.
+
+
+If GZIP/Deflate compression is being used, the compression level can be
+adjusted by setting :py:attr:`Options.gzip_compression_level` or passing
+``gzip_compression_level=X`` to :py:func:`write` and :py:func:`savemat`
+where ``X`` is an integer between ``0`` and ``9`` inclusive. ``0`` is
+the lowest compression, but is the fastest. ``9`` gives the best
+compression, but is the slowest. The default is ``7``.
+
+For all compression algorithms, there is an additional filter which can
+help achieve better compression at relatively low cost in CPU time. It
+is the shuffle filter. It is controlled by setting
+:py:attr:`Options.shuffle_filter` or passing ``shuffle_filter=X`` to
+:py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or
+``False``. The default is ``True``.
+
+
+Using Checksums
+===============
+
+Fletcher32 checksums can be calculated and stored for most types of
+stored data in an HDF5 file. These are then checked when the data is
+read to catch file corruption, which will cause an error when reading
+the data informing the user that there is data corruption. The filter
+can be enabled or disabled separately for data that is compressed and
+data that is not compressed (e.g. compression is disabled or the python
+object's data size is smaller than the compression threshold).
+
+For compressed data, it is controlled by setting
+:py:attr:`Options.compressed_fletcher32_filter` or passing
+``compressed_fletcher32_filter=X`` to :py:func:`write` and
+:py:func:`savemat` where ``X`` is ``True`` or ``False``. The default is
+``True``.
+
+For uncompressed data, it is controlled by setting
+:py:attr:`Options.uncompressed_fletcher32_filter` or passing
+``uncompressed_fletcher32_filter=X`` to :py:func:`write` and
+:py:func:`savemat` where ``X`` is ``True`` or ``False``. The default is
+``False``.
+
+
+.. note::
+   
+   Fletcher32 checksums are not computed for anything that is stored
+   as a :py:mod:`numpy` scalar.
+
+
+Further Reading
+===============
+
+.. seealso::
+
+   `HDF5 Datasets Filter pipeline <http://docs.h5py.org/en/latest/high/dataset.html#filter-pipeline>`_
+      Description of the Dataset filter pipeline in the :py:mod:`h5py`
+   
+   `Using Compression in HDF5 <http://www.hdfgroup.org/HDF5/faq/compression.html>`_
+      FAQ on compression from the HDF Group.
+   
+   `SZIP License <https://www.hdfgroup.org/doc_resource/SZIP/Commercial_szip.html>`_
+      The license for using the SZIP compression algorithm.
+
+   `SZIP COMPRESSION IN HDF PRODUCTS <https://www.hdfgroup.org/doc_resource/SZIP>`_
+      Information on using SZIP compression from the HDF Group.
+
+   `3rd Party Compression Algorithms for HDF5 <https://www.hdfgroup.org/services/contributions.html>`_
+      List of common additional compression algorithms.
+   
diff --git a/doc/source/index.rst b/doc/source/index.rst
index a0c006e..ea07a16 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -13,6 +13,7 @@ Contents:
 
    information
    introduction
+   compression
    storage_format
    development
    api

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/python-hdf5storage.git



More information about the debian-science-commits mailing list