Abstract
The open XML format mzML, used for representation of MS data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naïve mzML representation is fourfold or even up to 18-fold larger compared with the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.
Highlights
Open XML formats for representation of MS data have been developed by the proteomics community to facilitate exchange and vendor neutral analysis of mass spectrometry data
A raw data file from a data-‐independent acquisition experiment using an AB SCIEX TripleTOF resulted in a vendor format data file of 2.5 GB
To efficiently compress the three main types of binary data present in mzML files: i) mass to charge ratios, ii) ion counts and iii) retentions times, we have developed three new near-‐lossless compression algorithms, while ensuring for each data type that precision losses are well below the precision of the most advanced mass spectrometers of today
Summary
Open XML formats for representation of MS data have been developed by the proteomics community to facilitate exchange and vendor neutral analysis of mass spectrometry data. The mzML format has been adopted widely by the proteomics community and is supported by many data processing tools. No measurements are provided on the compression time and the algorithms are benchmarked on a very small set of data files.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have