MassComp, a lossless compressor for mass spectrometry data

Ruochen Yang,Xi Chen,Idoia Ochoa

doi:10.1186/s12859-019-2962-7

Ruochen Yang, Xi Chen + Show 1 more

Open Access

https://doi.org/10.1186/s12859-019-2962-7

Copy DOI

Abstract

BackgroundMass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. As a result, the amount of MS data has significantly increased in recent years. For example, the MS repository MassIVE contains more than 123TB of data. Somehow surprisingly, these data are stored uncompressed, hence incurring a significant storage cost. Efficient representation of these data is therefore paramount to lessen the burden of storage and facilitate its dissemination.ResultsWe present MassComp, a lossless compressor optimized for the numerical (m/z)-intensity pairs that account for most of the MS data. We tested MassComp on several MS data and show that it delivers on average a 46% reduction on the size of the numerical data, and up to 89%. These results correspond to an average improvement of more than 27% when compared to the general compressor gzip and of 40% when compared to the state-of-the-art numerical compressor FPC. When tested on entire files retrieved from the MassIVE repository, MassComp achieves on average a 59% size reduction. MassComp is written in C++ and freely available at https://github.com/iochoa/MassComp.ConclusionsThe compression performance of MassComp demonstrates its potential to significantly reduce the footprint of MS data, and shows the benefits of designing specialized compression algorithms tailored to MS data. MassComp is an addition to the family of omics compression algorithms designed to lessen the storage burden and facilitate the exchange and dissemination of omics data.

Highlights

Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses
mass spectrometry (MS) files are stored uncompressed, and we compare the performance of MassComp to that of the general lossless compressor gzip, the state-of-the-art numerical compressor FPC [20], and the family of numerical compressors MS-Numpress [16]. gzip was chosen for baseline performance over other general lossless compressors as it is used in practice as the de-facto compressor for other omics data, such as genomics
The MS repository MassIVE contains more than 123TB of data

Summary

Introduction

Mass Spectrometry (MS) is a widely used technique in biology research, and has become key in proteomics and metabolomics analyses. The MS repository MassIVE contains more than 123TB of data Somehow surprisingly, these data are stored uncompressed, incurring a significant storage cost. The field of metabolomics, which aims at the comprehensive and quantitative analysis of wide arrays of metabolites in biological samples, is developing thanks to the advancements in MS technology [3]. To facilitate the exchange and dissemination of these data, several centralized data repositories have been created that make the data and results accessible to researchers and biologists alike. Examples of such repositories include GPMDB (Global Proteome Machine Database) [8], PeptideAtlas/PASSEL [9, 10], PRIDE [11, 12] and MassIVE (Mass Spectrometry Interactive Virtual Environment) [13]. MassIVE contains more than 2 million files worth 123TB of storage, and PRIDE contains around 7000 projects and 74,000 assays

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jul 1, 2019
Citations: 9	License type: open-access

R Discovery Prime

R Discovery Prime

MassComp, a lossless compressor for mass spectrometry data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Anomaly Preserving Content-Aware Hyperspectral Image Size Reduction
Alp Erturk ... Sarp Erturk
-
Alp Erturk, et. al.Alp Erturk ... Sarp Erturk
01 Jul 2018
01 Jul 2018

11 - Systematic survey of compression algorithms in medical imaging
Sartajvir Singh ... Bhisham Sharma
Advances in Computational Techniques for Biomedical Image Analysis | VOL. -
Sartajvir Singh, et. al.Sartajvir Singh ... Bhisham Sharma
01 Jan 2020
Advances in Computational Techniques for Biomedical Image Analysis | VOL. -

Unbiased evaluation of bioactive secondary metabolites in complex matrices
Taichi Inui ... Guido F Pauli
Fitoterapia | VOL. 83
Taichi Inui, et. al.Taichi Inui ... Guido F Pauli
02 Jul 2012
Fitoterapia | VOL. 83

Ocean Data Portal: A Standards Approach to Data Access and Dissemination
Greg Reed
-
Greg ReedGreg Reed
31 Dec 2010
31 Dec 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

MassComp, a lossless compressor for mass spectrometry data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics