The arc of Mass Spectrometry Exchange Formats is long, but it bends toward HDF5.

Manor Askenazi,Johannes Graumann,Hisham Ben Hamidane

doi:10.1002/mas.21522

Manor Askenazi, Johannes Graumann + Show 1 more

Open Access

https://doi.org/10.1002/mas.21522

Copy DOI

Abstract

The evolution of data exchange in Mass Spectrometry spans decades and has ranged from human‐readable text files representing individual scans or collections thereof (McDonald et al., 2004) through the official standard XML‐based (Harold, Means, & Udemadu, 2005) data interchange standard (Deutsch, 2012), to increasingly compressed (Teleman et al., 2014) variants of this standard sometimes requiring purely binary adjunct files (Römpp et al., 2011). While the desire to maintain even partial human readability is understandable, the inherent mismatch between XML's textual and irregular format relative to the numeric and highly regular nature of actual spectral data, along with the explosive growth in dataset scales and the resulting need for efficient (binary and indexed) access has led to a phenomenon referred to as “technical drift” (Davis, 2013). While the drift is being continuously corrected using adjunct formats, compression schemes, and programs (Röst et al., 2015), we propose that the future of Mass Spectrometry Exchange Formats lies in the continued reliance and development of the PSI‐MS (Mayer et al., 2014) controlled vocabulary, along with an expedited shift to an alternative, thriving and well‐supported ecosystem for scientific data‐exchange, storage, and access in binary form, namely that of HDF5 (Koranne, 2011). Indeed, pioneering efforts to leverage this universal, binary, and hierarchical data‐format have already been published (Wilhelm et al., 2012; Rübel et al., 2013) though they have under‐utilized self‐description, a key property shared by HDF5 and XML. We demonstrate that a straightforward usage of plain (“vanilla”) HDF5 yields immediate returns including, but not limited to, highly efficient data access, platform independent data viewers, a variety of libraries (Collette, 2014) for data retrieval and manipulation in many programming languages and remote data access through comprehensive RESTful data‐servers. © 2016 The Authors. Mass Spectrometry Reviews published by Wiley Periodicals, Inc. Mass Spec Rev 36:668–673, 2017

Highlights

As a general rule, mass spectrometers produce output stored in manufacturer-specific proprietary file formats
Beyond the inherently reduced accessibility of proprietary binary files the formats and tools used are often burdened with requirements for backward compatibility bridging decades, as well as software dependencies precluding the use of vendorprovided tools on UNIX-like systems central to many highperformance data analytic pipelines. In response to this situation, the proteomics community set about to define a controlled vocabulary for mass spectrometry data as well as a standard format for data interchange. It settled on the PSI-MS for the controlled vocabulary: expressed as a 17,000 Line OBO v1.2 file (Open Biomedical Ontologies [Smith et al, 2007]) this extremely detailed and increasingly comprehensive formal standard defines and inter-relates most key concepts in the field of mass-spectrometry and mass-spectrometry-based proteomics (e.g., MS:1000628 is the formal accession for a “basepeak chromatogram” which is a kind of MS:1000810, i.e., “mass chromatogram”)
Following the adoption of mzML by the mass spectrometry community, a gulf of ever increasing size has emerged between the performance characteristics of XML and the requirements of sharing and efficiently accessing data sets: software systems built around the XML-based format suffer from a mismatch between the data being stored and the storage format being used, with a resulting penalty in terms of file size and performance characteristics

Summary

INTRODUCTION

Mass spectrometers produce output stored in manufacturer-specific proprietary file formats. A combination of the need to efficiently store ever larger data sets, regulatory In response to this situation, the proteomics community set about to define a controlled vocabulary for mass spectrometry data as well as a standard format for data interchange. The official data interchange format that was selected by the community, namely mzML (Deutsch, 2008), does provide this mapping, and does so in every instance (i.e., in every individual mzML file). This is because the PSI community desired the standard to be self-describing, as well as universally supported and human readable. These requirements led to the choice of XML as the underlying representation technology for the mzML format

SCALING BEYOND HUMAN READABILITY

HDF5 IS A NATURAL CHOICE FOR LARGE SCIENTIFIC DATASETS

OpenMSI AND mz5—HDF5 PIONEERS IN MASS SPECTROMETRY

HDF5 SUPPORTS THE EVOLVING FUNCTION OF DATA INTERCHANGE FORMATS

PROOF OF CONCEPT IMPLEMENTATION

VIII. ACCESSING HDF5 DATA WITH STANDARD AND CUSTOM APIs

CONCLUSION

Methods

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Mass spectrometry reviews	Publication Date: Oct 14, 2016
Citations: 10	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The arc of Mass Spectrometry Exchange Formats is long, but it bends toward HDF5.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mass spectrometry reviews

Lead the way for us

Similar Papers

MzServer: Web-based Programmatic Access for Mass Spectrometry Data Analysis
Manor Askenazi ... James T Webber
Molecular & Cellular Proteomics | VOL. 10
Manor Askenazi, et. al.Manor Askenazi ... James T Webber
25 Jan 2011
Molecular & Cellular Proteomics | VOL. 10

Experimental Directory Structure (Exdir): An Alternative to HDF5 Without Introducing a New File Format.
Svenn-Arne Dragly ... Milad Hobbi Mobarhan
Frontiers in neuroinformatics | VOL. 12
Svenn-Arne Dragly, et. al.Svenn-Arne Dragly ... Milad Hobbi Mobarhan
13 Apr 2018
Frontiers in neuroinformatics | VOL. 12

Detecting Malicious Code by Binary File Checking
Marius Popa
Informatica Economica | VOL. 18
Marius PopaMarius Popa
30 Mar 2014
Informatica Economica | VOL. 18

The Impact of the Data Archiving File Format on Scientific Computing and Performance of Image Processing Algorithms in MATLAB Using Large HDF5 and XML Multimodal and Hyperspectral Data Sets
Kelly Bennett ... James Robertso
-
Kelly Bennett, et. al.Kelly Bennett ... James Robertso
13 Oct 2011
13 Oct 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The arc of Mass Spectrometry Exchange Formats is long, but it bends toward HDF5.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Mass spectrometry reviews