[python-hdf5storage] 103/152: Added documentation on the storage format.
Ghislain Vaillant
ghisvail-guest at moszumanska.debian.org
Mon Feb 29 08:24:39 UTC 2016
This is an automated email from the git hooks/post-receive script.
ghisvail-guest pushed a commit to annotated tag 0.1
in repository python-hdf5storage.
commit 1b70ac1da7fd8abdae9864b3658117b52860476a
Author: Freja Nordsiek <fnordsie at gmail.com>
Date: Fri Feb 7 02:33:41 2014 -0500
Added documentation on the storage format.
---
doc/source/index.rst | 1 +
doc/source/storage_format.rst | 340 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 341 insertions(+)
diff --git a/doc/source/index.rst b/doc/source/index.rst
index 72078ed..700f171 100644
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@@ -13,6 +13,7 @@ Contents:
information
introduction
+ storage_format
api
Indices and tables
diff --git a/doc/source/storage_format.rst b/doc/source/storage_format.rst
new file mode 100644
index 0000000..7a13946
--- /dev/null
+++ b/doc/source/storage_format.rst
@@ -0,0 +1,340 @@
+.. currentmodule:: hdf5storage
+
+==============
+Storage Format
+==============
+
+This package adopts certain conventions for the conversion and storage
+of Python datatypes and the metadata that is written with them. Then, to
+make the data MATLAB MAT file compatible, additional metadata must be
+written. This page assumes that one has imported collections and numpy
+as ::
+
+ import collections as cl
+ import numpy as np
+
+
+MATLAB File Header
+==================
+
+In order for a file to be MATLAB v7.3 MAT file compatible, it must have
+a properly formatted file header, or userblock in HDF5 terms. The file
+must have a 512 byte userblock, of which 128 bytes are used. The 128
+bytes consists of a 116 byte string (spaces pad the end) followed by a
+specific 12 byte sequence (magic number). On MATLAB, the 116 byte string, depending on the computer system and the date, looks like ::
+
+ 'MATLAB 7.3 MAT-file, Platform: GLNXA64, Created on: Fri Feb 07 02:29:00 2014 HDF5 schema 1.00 .'
+
+This package just changes the Platform part to ::
+
+ 'CPython A.B.C'
+
+Where A, B, and C are the major, minor, and micro version numbers of the Python interpreter (e.g. 3.3.0).
+
+The 12 byte sequence, in hexidecimal is ::
+
+ 00000000 00000000 0002494D
+
+
+How Data Is Stored
+==================
+
+All data is stored either as a Dataset or as a Group. Most non-Numpy
+types must be converted to a Numpy type before they are written, and
+some Numpy types must be converted to other ones before being
+written. The table below lists how every supported Python datatype is
+stored (Group or Dataset), what type/s it is converted to (no conversion
+if none are listed), as well as the first version of this package to
+support the datatype.
+
+============= ======= ============================ ================
+Type Version Converted to Group or Dataset
+============= ======= ============================ ================
+bool 0.1 np.bool\_ or np.uint8 [1]_ Dataset
+None 0.1 ``np.float64([])`` Dataset
+int 0.1 np.int64 Dataset
+float 0.1 np.float64 Dataset
+complex 0.1 np.complex128 Dataset
+str 0.1 np.uint32/16 [2]_ Dataset
+bytes 0.1 np.bytes\_ or np.uint16 [3]_ Dataset
+bytearray 0.1 np.bytes\_ or np.uint16 [3]_ Dataset
+list 0.1 np.object\_ Dataset
+tuple 0.1 np.object\_ Dataset
+set 0.1 np.object\_ Dataset
+frozenset 0.1 np.object\_ Dataset
+cl.deque 0.1 np.object\_ Dataset
+dict [4]_ 0.1 Group
+np.bool\_ 0.1 not or np.uint8 [1]_ Dataset
+np.uint8 0.1 Dataset
+np.uint16 0.1 Dataset
+np.uint32 0.1 Dataset
+np.uint64 0.1 Dataset
+np.uint8 0.1 Dataset
+np.int16 0.1 Dataset
+np.int32 0.1 Dataset
+np.int64 0.1 Dataset
+np.float16 0.1 Dataset
+np.float32 0.1 Dataset
+np.float64 0.1 Dataset
+np.complex64 0.1 Dataset
+np.complex128 0.1 Dataset
+np.str\_ 0.1 np.uint32/16 [2]_ Dataset
+np.bytes\_ 0.1 np.bytes\_ or np.uint16 [3]_ Dataset
+np.object\_ 0.1 Dataset
+============= ======= ============================ ================
+
+.. [1] Depends on the selected options. Always ``np.uint8`` when
+ ``convert_bools_to_uint8 == True`` (set implicitly when
+ ``matlab_compatible == True``).
+.. [2] Depends on the selected options and whether it can be converted
+ to UTF-16 without using doublets. If
+ ``convert_numpy_str_to_utf16 == True`` (set implicitly when
+ ``matlab_compatible == True``) and it can be converted to UTF-16
+ without losing any characters that can't be represented in UTF-16
+ or using UTF-16 doublets (MATLAB doesn't support them), then it
+ is written as ``np.uint16`` in UTF-16 encoding. Otherwise, it is
+ stored at ``np.uint32`` in UTF-32 encoding.
+.. [3] Depends on the selected options. If
+ ``convert_numpy_bytes_to_utf16 == True`` (set implicitly when
+ ``matlab_compatible == True``), it will be stored as
+ ``np.uint16`` in UTF-16 encoding. Otherwise, it is just written
+ as ``np.bytes_``.
+.. [4] All keys must be ``str``.
+
+
+Attributes
+==========
+
+Many different HDF5 Attributes are set for each object written if the
+:py:attr:`Options.store_type_information` and/or
+:py:attr:`Options.matlab_compatible` options are set. The attributes
+associated with each will be referred to as "Python Attributes" and
+"MATLAB Attributes" respectively. If neither of them are set, then no
+Attributes are used. The table below lists the Attributes that have
+definite values depending only on the particular Python datatype being
+stored. Then, the other attributes are detailed individually.
+
+.. note
+
+ 'Python.Type', 'Python.numpy.UnderlyingType', and 'MATLAB_class' are
+ all ``np.str_``. 'MATLAB_int_decode' is a ``np.int64``.
+
+============= =================== =========================== ================== =================
+ Python Attributes MATLAB Attributes
+ ------------------------------------------------ -------------------------------------
+Type Python.Type Python.numpy.UnderlyingType MATLAB_class MATLAB_int_decode
+============= =================== =========================== ================== =================
+bool 'bool' 'bool' 'logical' 1
+None 'builtins.NoneType' 'float64' 'double'
+int 'int' 'int64' 'int64'
+float 'float' 'float64' 'double'
+complex 'complex' 'complex128' 'double'
+str 'str' 'str#' [5]_ 'char' 2
+bytes 'bytes' 'bytes#' [5]_ 'char' 2
+bytearray 'bytearray' 'bytes#' [5]_ 'char' 2
+list 'list' 'object' 'cell'
+tuple 'tuple' 'object' 'cell'
+set 'set' 'object' 'cell'
+frozenset 'frozenset' 'object' 'cell'
+cl.deque 'collections.deque' 'object' 'cell'
+dict 'dict' 'struct'
+np.bool\_ 'numpy.bool' 'bool' 'logical' 1
+np.uint8 'numpy.uint8' 'uint8' 'uint8'
+np.uint16 'numpy.uint16' 'uint16' 'uint16'
+np.uint32 'numpy.uint32' 'uint32' 'uint32'
+np.uint64 'numpy.uint64' 'uint64' 'uint64'
+np.uint8 'numpy.int8' 'int8' 'int8'
+np.int16 'numpy.int16' 'int16' 'int16'
+np.int32 'numpy.int32' 'int32' 'int32'
+np.int64 'numpy.int64' 'int64' 'int64'
+np.float16 'numpy.float16' 'float16'
+np.float32 'numpy.float32' 'float32' 'single'
+np.float64 'numpy.float64' 'float64' 'double'
+np.complex64 'numpy.complex64' 'complex64' 'single'
+np.complex128 'numpy.complex128' 'complex128' 'double'
+np.str\_ 'numpy.str\_' 'str#' [5]_ 'char' or 'uint32' 2 or 4 [6]_
+np.bytes\_ 'numpy.bytes\_' 'bytes#' [5]_ 'char' 2
+np.object\_ 'numpy.object\_' 'object' 'cell'
+============= =================== =========================== ================== =================
+
+.. [5] '#' is replaced by the number of bits taken up by the string, or
+ each string in the case that it is an array of strings. This is 8
+ and 32 bits per character for ``np.bytes_`` and ``np.str_``
+ respectively.
+.. [6] ``2`` if it is stored as ``np.uint16`` or ``4`` if ``np.uint32``.
+
+
+Python.Shape
+------------
+
+Python Attribute
+
+``np.ndarray(dtype='uint64')``
+
+Every Python datatype that is or ends up being converted to a Numpy
+datatype has a shape attribute, which is stored in this Attribute. This
+holds the shape before any conversions of arrays to at least 2D, array
+transposes, or conversions of strings to unsigned integer types.
+
+Python.numpy.Container
+----------------------
+
+Python Attribute
+
+{'scalar', 'ndarray', 'matrix'}
+
+For Numpy types (or types converted to them), whether the type is a
+scalar (its type is something such as ``np.uint16``, ``np.str_``, etc.),
+some form of array (its type is ``np.ndarray``), or a matrix (type
+is ``np.matrix``) is stored in this Attribute.
+
+Python.Empty and MATLAB_empty
+-----------------------------
+
+Python and MATLAB Attributes respectively
+
+``np.uint8``
+
+If the datatype being stored has zero elements, then this Attribute is
+set to ``1``. Otherwise, the Attribute is deleted. For Numpy types (or
+those converted to them), the shape after conversions to at least 2D,
+array transposes, and conversions of strings to unsigned integer types
+is stored in place of the data as an array of ``np.uint64`` if
+:py:attr:`Options.store_shape_for_empty` is set (set implicitly if the
+`matlab_compatible` option is set).
+
+H5PATH
+------
+
+MATLAB Attribute
+
+``np.str_``
+
+For every object that is stored inside a Group other than the root of
+the HDF5 file (``'/'``), the path to the object is stored in this
+Attribute. MATLAB does not seem to require this Attribute to be there,
+though it does set it in the files it produces.
+
+MATLAB_fields
+-------------
+
+MATLAB Attribute
+
+complicated array of string arrays not supported by h5py
+
+For MATLAB structures, MATLAB sets this field to all of the field names
+of the structure. If this Attribute is missing, MATLAB does not seem to
+care. Trying to set it to a differently formatted array of strings that
+the h5py package can handle causes an error in MATLAB when the file is
+imported, so this package does not set this Attribute at all.
+
+
+Storage of Special Types
+========================
+
+Complex numbers and ``np.object_`` arrays (and things converted to them)
+have to be stored in a special fashion.
+
+Since HDF5 has no builtin complex type, complex numbers are stored as an
+HDF5 COMPOUND type with different fieldnames for the real and imaginary
+partd like many other pieces of software (including MATLAB)
+do. Unfortunately, there is not a standardized pair of field names. h5py
+by default uses 'r' and 'i' for the real and imaginary parts. MATLAB
+uses 'real' and 'imag' instead. The :py:attr:`Options.complex_names`
+option controls the field names (given as a tuple in real, imaginary
+order) that are used for complex numbers as they are written. It is set
+automatically to ``('real', 'imag')`` when
+``matlab_compatible == True``. When reading data, this package
+automatically checks numeric types for many combinations of reasonably
+expected field names to find complex types.
+
+When storing ``np.object_`` arrays, the individual elements are stored
+elsewhere and then an array of HDF5 Object References to their storage
+locations is written as the data object. The elements are all written to
+the Group path set by :py:attr:`Options.group_for_references` with a
+randomized name (this package keeps generating randomized names till an
+available one is found). It must be ``'/#refs#'`` for MATLAB (setting
+``matlab_compatible`` sets this automatically).
+
+
+Optional Data Transformations
+=============================
+
+Many different data conversions beyond turning most non-Numpy types into
+Numpy types, can be done and are controlled by individual settings in
+the :py:class:`Options` class (most are set to fixed values when
+``matlab_compatible == True``). The transfomations are listed below by
+their option name.
+
+delete_unused_variables
+-----------------------
+
+``bool``
+
+Whether any variable names in something that would be stored as an HDF5
+Group (would end up a struct in MATLAB) that currently exist in the file
+but are not in the object being stored should be deleted on the file or
+not.
+
+make_at_least_2d
+----------------
+
+``bool``
+
+Whether all Numpy types (or things converted to them) should be made
+into arrays of 2 dimensions if they have less than that or not. This
+option is set to ``True`` implicitly by ``matlab_compatible``.
+
+convert_numpy_bytes_to_utf16
+----------------------------
+
+``bool``
+
+Whether all ``np.bytes_`` strings (or things converted to it) should be
+converted to UTF-16 and written as an array of ``np.uint16`` or not. This
+option is set to ``True`` implicitly by ``matlab_compatible``.
+
+convert_numpy_str_to_utf16
+--------------------------
+
+``bool``
+
+Whether all ``np.str_`` strings (or things converted to it) should be
+converted to UTF-16 and written as an array of ``np.uint16`` if the
+strings use no characters outside of the UTF-16 set and the conversion
+does not result in any UTF-16 doublets or not. This option is set to
+``True`` implicitly by ``matlab_compatible``.
+
+convert_bools_to_uint8
+----------------------
+
+``bool``
+
+Whether the ``np.bool_`` type (or things converted to it) should be
+converted to ``np.uint8`` (``True`` becomes ``1`` and ``False`` becomes
+``0``) or not. If not, then the h5py default of an enum type that is not
+MATLAB compatible is used. This option is set to ``True`` implicitly by
+``matlab_compatible``.
+
+reverse_dimension_order
+-----------------------
+
+``bool``
+
+Whether the dimension order of all arrays should be reversed
+(essentially a transpose) or not before writing to the file. This option
+is set to ``True`` implicitly by ``matlab_compatible``. This option
+needs to be set if one wants an array to end up the same shape when
+imported into MATLAB. This option is necessary because Numpy and MATLAB
+use opposite dimension ordering schemes, which are C and Fortan schemes
+respectively. 2D arrays are stored by row in the C scheme and column in
+the Fortan scheme.
+
+store_shape_for_empty
+---------------------
+
+``bool``
+
+Whether, for empty arrays, to store the shape of the array (after
+transformations) as the Dataset for the object. This option is set to
+``True`` implicitly by ``matlab_compatible``.
--
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-science/packages/python-hdf5storage.git
More information about the debian-science-commits
mailing list