Reproducibility of mass spectrometry based metabolomics data

Tusharkanti Ghosh,Katerina Kechris,Daisy Philtron,Weiming Zhang,Debashis Ghosh

doi:10.1186/s12859-021-04336-9

Abstract

BackgroundAssessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. We define a metabolite to be reproducible when it demonstrates consistency across replicate experiments. Similarly, metabolites which are not consistent across replicates can be labeled as irreproducible. In this work, we introduce and evaluate the use (Ma)ximum (R)ank (R)eproducibility (MaRR) to examine reproducibility in mass spectrometry-based metabolomics experiments. We examine reproducibility across technical or biological samples in three different mass spectrometry metabolomics (MS-Metabolomics) data sets.ResultsWe apply MaRR, a nonparametric approach that detects the change from reproducible to irreproducible signals using a maximal rank statistic. The advantage of using MaRR over model-based methods that it does not make parametric assumptions on the underlying distributions or dependence structures of reproducible metabolites. Using three MS Metabolomics data sets generated in the multi-center Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPD) study, we applied the MaRR procedure after data processing to explore reproducibility across technical or biological samples. Under realistic settings of MS-Metabolomics data, the MaRR procedure effectively controls the False Discovery Rate (FDR) when there was a gradual reduction in correlation between replicate pairs for less highly ranked signals. Simulation studies also show that the MaRR procedure tends to have high power for detecting reproducible metabolites in most situations except for smaller values of proportion of reproducible metabolites. Bias (i.e., the difference between the estimated and the true value of reproducible signal proportions) values for simulations are also close to zero. The results reported from the real data show a higher level of reproducibility for technical replicates compared to biological replicates across all the three different datasets. In summary, we demonstrate that the MaRR procedure application can be adapted to various experimental designs, and that the nonparametric approach performs consistently well.ConclusionsThis research was motivated by reproducibility, which has proven to be a major obstacle in the use of genomic findings to advance clinical practice. In this paper, we developed a data-driven approach to assess the reproducibility of MS-Metabolomics data sets. The methods described in this paper are implemented in the open-source R package marr, which is freely available from Bioconductor at http://bioconductor.org/packages/marr.

Highlights

Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments
We demonstrate that the (Ma)ximum (R)ank (R)eproducibility (MaRR) procedure can be adapted to high-throughput MS-Metabolomics experiments across replicate samples
Using the Maximum rank reproducibility (MaRR) procedure, we propose to examine the reproducibility of ranked lists from replicate experiments and assess how concordant the metabolites are ranked in replicate experiments

Summary

Introduction

Assessing the reproducibility of measurements is an important first step for improving the reliability of downstream analyses of high-throughput metabolomics experiments. Untargeted metabolomics experiments measure the totality of ions in a set of predefined mass range [3, 4]. Among the platforms employed for measuring metabolites, Gas Chromatography Mass Spectrometry (GC–MS) and Liquid Chromatography Mass Spectrometry (LC–MS) are popular due to their sensitivity and coverage of all possible ions [5] These GC–MS and LC–MS techniques prepare a sample at a high resolution, fragment it into ions and isolate the ions to generate spectra for the sample [6]. Many metabolomics data sets can have a large number of falsely identified metabolites or metabolite features with incorrect integration regions and missing values, which affect the reproducibility of the study [11, 12]. We use the term metabolites to refer to small compound features resulting from a metabolomics experiment in the rest of this article

Objectives

Methods

Results

Discussion

Conclusion