vendredi 8 mai 2015

'/' in names in HDF5 files confusion

I am experiencing some really weird interactions between h5py, PyTables (via Pandas), and C++ generated HDF5 files. It seems that, h5check and h5py seem to cope with type names containing '/' but pandas/PyTables cannot. Clearly, there is a gap in my understanding, so:

What have I not understood here?


The gory details

I have the following data in a HDF5 file:

   [...]
   DATASET "log" {
      DATATYPE  H5T_COMPOUND {
         H5T_COMPOUND {
            H5T_STD_U32LE "sec";
            H5T_STD_U32LE "usec";
         } "time";
         H5T_IEEE_F32LE "CIF/align/aft_port_end/extend_pressure";
         [...]

This was created via the C++ API. The h5check utility says the file is valid.

Note that CIF/align/aft_port_end/extend_pressure is not meant as a path to a group/node/leaf. It is a label, that we use internally which happens to have some internal structure that contains '/' as delimiters. We do not want the HDF5 file to know anything about that: it should not care. Clearly, if '/' are illegal in any HDF5 name, then we have to change that delimiter to something else.

Using PyTables (okay, Pandas but it uses PyTables internally) to read the file, I get an

 >>> import pandas as pd
 >>> store = pd.HDFStore('data/XXX-20150423-071618.h5')
 >>> store
/home/XXX/virt/env/develop/lib/python2.7/site-packages/tables/group. py:1156: UserWarning: problems loading leaf ``/log``::

  the ``/`` character is not allowed in object names: 'XXX/align/aft_port_end/extend_pressure'

The leaf will become an ``UnImplemented`` node. 

I asked about this in this question and got told that '/' are illegal in the specification. However, things get stranger with h5py...

Using h5py to read the file, I get what I want:

>>> f['/log'].dtype
>>> dtype([('time', [('sec', '<u4'), ('usec', '<u4')]), ('CI
F/align/aft_port_end/extend_pressure', '<f4')[...]

Which is more or less what I set out with.

Needless to say, I am confused. Have I managed to create an illegal HDF5 file that somehow passes h5check? Is PyTables not supporting this edge case? ... I am confused.


Clearly, I could write a simple wrapper something like this:

>>> import matplotlib.pyplot as plt
>>> silly = pd.DataFrame(f['/log']['CIF/align/aft_port_end/extend_pressure'])
>>> silly.plot()
>>> plt.show()

to get all the data from the HDF5 file into Pandas. However, I am not sure if this is a good idea because of the confusion earlier. My biggest worry is the conversion might not scale if the data is very large...

Aucun commentaire:

Enregistrer un commentaire