Abstract

Modern high-performance computing (HPC) tasks overwhelm conventional geophysical data formats. We describe a new data schema called HDF5eis (read H-D-F-size) for handling big multidimensional time series data from environmental sensors in HPC applications and implement a freely available Python application programming interface (API) for building and processing HDF5eis files. HDF5eis augments the popular Hierarchical Data Format 5 with a minimal set of additional conventions that facilitate fast and flexible data input and output protocols for regularly sampled (in time) data with any number of dimensions. HDF5eis supports arbitrary ancillary data (e.g., metadata) storage in columnar format or as UTF-8 encoded byte streams alongside time series data. Our HDF5eis API enables simple and efficient access to big data sets distributed across a potentially large number of small heterogeneous files through a single point of access. HDF5eis outperforms conventional seismic data formats by up to two orders of magnitude in terms of random read access times. We contribute HDF5eis as an operational tool and an experimental draft proposal that will help establish the next generation of data standards in the earth sciences.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call