MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements.

Ranjeet S Bhamber,Eric W Deutsch,Andris Jankevics,Andrew R Jones,Andrew W Dowsey

doi:10.1021/acs.jproteome.0c00192

Abstract

With ever-increasing amounts of data produced by mass spectrometry (MS) proteomics and metabolomics, and the sheer volume of samples now analyzed, the need for a common open format possessing both file size efficiency and faster read/write speeds has become paramount to drive the next generation of data analysis pipelines. The Proteomics Standards Initiative (PSI) has established a clear and precise extensible markup language (XML) representation for data interchange, mzML, receiving substantial uptake; nevertheless, storage and file access efficiency has not been the main focus. We propose an HDF5 file format “mzMLb” that is optimized for both read/write speed and storage of the raw mass spectrometry data. We provide an extensive validation of the write speed, random read speed, and storage size, demonstrating a flexible format that with or without compression is faster than all existing approaches in virtually all cases, while with compression is comparable in size to proprietary vendor file formats. Since our approach uniquely preserves the XML encoding of the metadata, the format implicitly supports future versions of mzML and is straightforward to implement: mzMLb’s design adheres to both HDF5 and NetCDF4 standard implementations, which allows it to be easily utilized by third parties due to their widespread programming language support. A reference implementation within the established ProteoWizard toolkit is provided.

Highlights

Through an extensive industry-wide collaborative process, in2008, the Proteomics Standards Initiative (PSI) established a standardized Extensible Markup Language (XML) representation for raw data interchange in mass spectrometry (MS),1 “mzML,” further building upon concepts defined in earlier formats mzData and mzXML.[2] mzML is the pervasive format for interchange and deposition of raw mass spectrometry (MS) proteomics and metabolomics data.[3]
Two data types are contained within raw mass spectrometry (MS) data sets: (a) numeric data, i.e., mass over charge and spectral/chomatographic intensities; and (b) metadata related to instrument and experimental settings. mzML encodes these data types within a rich, schema-linked XML file, where the metadata is accurately and unambiguously annotated using the PSI-MS controlled vocabulary[4] (CV)
We demonstrate that using a hybrid file format based on storing XML metadata together with native binary data within a HDF5 file, it is possible to improve the data reading/writing speed of raw MS data as well as preserve all related metadata in PSI-compliant mzML in an implicitly future-proof way

Summary

Introduction

2008, the Proteomics Standards Initiative (PSI) established a standardized Extensible Markup Language (XML) representation for raw data interchange in mass spectrometry (MS),1 “mzML,” further building upon concepts defined in earlier formats mzData and mzXML.[2] mzML is the pervasive format for interchange and deposition of raw mass spectrometry (MS) proteomics and metabolomics data.[3] to provide a detailed, flexible, consistent, and simple standard for the sharing of raw MS data, it was designed around a generic ontology for its representation at the expense of inefficient storage and file access. The first approach to address the performance and file size issues of mzML was mz5.6 At the core of mz[5] is HDF59 (Hierarchical Data Format version 5), originally developed by the National Center for Supercomputing Applications (NCSA) for the storage and organization of large amounts of data. The two primary objects represented in HDF5 files are “groups” and “data sets.” Groups are Received: March 24, 2020 Published: August 31, 2020

Methods

Results

Conclusion