Abstract

Natural sciences generate an increasing amount of data in a wide range of formats developed by different research groups and commercial companies. At the same time there is a growing desire to share data along with publications in order to enable reproducible research. Open formats have publicly available specifications which facilitate data sharing and reproducible research. Hierarchical Data Format 5 (HDF5) is a popular open format widely used in neuroscience, often as a foundation for other, more specialized formats. However, drawbacks related to HDF5's complex specification have initiated a discussion for an improved replacement. We propose a novel alternative, the Experimental Directory Structure (Exdir), an open specification for data storage in experimental pipelines which amends drawbacks associated with HDF5 while retaining its advantages. HDF5 stores data and metadata in a hierarchy within a complex binary file which, among other things, is not human-readable, not optimal for version control systems, and lacks support for easy access to raw data from external applications. Exdir, on the other hand, uses file system directories to represent the hierarchy, with metadata stored in human-readable YAML files, datasets stored in binary NumPy files, and raw data stored directly in subdirectories. Furthermore, storing data in multiple files makes it easier to track for version control systems. Exdir is not a file format in itself, but a specification for organizing files in a directory structure. Exdir uses the same abstractions as HDF5 and is compatible with the HDF5 Data Model. Several research groups are already using data stored in a directory hierarchy as an alternative to HDF5, but no common standard exists. This complicates and limits the opportunity for data sharing and development of common tools for reading, writing, and analyzing data. Exdir facilitates improved data storage, data sharing, reproducible research, and novel insight from interdisciplinary collaboration. With the publication of Exdir, we invite the scientific community to join the development to create an open specification that will serve as many needs as possible and as a foundation for open access to and exchange of data.

Highlights

  • Technology development is continuously driving science to new discoveries

  • We have summarized the limitations and challenges from Greenfield et al (2015) that are most relevant for scientific use along with some additional drawbacks which are addressed with Experimental Directory Structure (Exdir): 1. Metadata is stored in a binary format which makes it unreadable without tools that read Hierarchical Data Format 5 (HDF5) files

  • ASDF does not provide a convenient way to store raw data in the internal hierarchy. Some specifications, such as the Brain Imaging Data Structure (BIDS) (Gorgolewski et al, 2016), approach the above problems by using the file systems to define the data hierarchy, which is similar to the solution we propose with Exdir

Read more

Summary

SIGNIFICANCE STATEMENT

An alternative storage solution that improves on certain drawbacks of Hierarchical Data Format 5 (HDF5) is to use directories in the file system to define a hierarchy, and store data in binary files, and metadata in text files. While this strategy can be deployed in various ways by research groups, no common standard for such a storage solution exists. Experimental Directory Structure (Exdir) is a proposal to standardize this storage solution. We envision the establishment of such a standard and present Exdir to the community as a starting point

INTRODUCTION
EXISTING ALTERNATIVES
Other Formats
Requirements of a New Specification
STANDARDS USED IN EXDIR
BASIC STRUCTURE OF EXDIR DIRECTORIES
Dataset
REFERENCE IMPLEMENTATION IN PYTHON
Overview of the Exdir API in Python
Exdir Plugins
Converting From Using HDF5 to Exdir
Reading and Writing to Exdir in Other Languages
Exdir Command Line Interface
Exdir browser
PERFORMANCE
Findings
DISCUSSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.