Abstract

Missing values are known to be problematic for the analysis of gas chromatography-mass spectrometry (GC-MS) metabolomics data. Typically these values cover about 10%–20% of all data and can originate from various backgrounds, including analytical, computational, as well as biological. Currently, the most well known substitute for missing values is a mean imputation. In fact, some researchers consider this aspect of data analysis in their metabolomics pipeline as so routine that they do not even mention using this replacement approach. However, this may have a significant influence on the data analysis output(s) and might be highly sensitive to the distribution of samples between different classes. Therefore, in this study we have analysed different substitutes of missing values namely: zero, mean, median, k-nearest neighbours (kNN) and random forest (RF) imputation, in terms of their influence on unsupervised and supervised learning and, thus, their impact on the final output(s) in terms of biological interpretation. These comparisons have been demonstrated both visually and computationally (classification rate) to support our findings. The results show that the selection of the replacement methods to impute missing values may have a considerable effect on the classification accuracy, if performed incorrectly this may negatively influence the biomarkers selected for an early disease diagnosis or identification of cancer related metabolites. In the case of GC-MS metabolomics data studied here our findings recommend that RF should be favored as an imputation of missing value over the other tested methods. This approach displayed excellent results in terms of classification rate for both supervised methods namely: principal components-linear discriminant analysis (PC-LDA) (98.02%) and partial least squares-discriminant analysis (PLS-DA) (97.96%) outperforming other imputation methods.

Highlights

  • Metabolomics is the study of all metabolites in a biological system under a given set of conditions [1].The classical technologies for the analysis of metabolites include: Gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), capillary electrophoresis-mass spectrometry (CE-MS), nuclear magnetic resonance (NMR) spectroscopy, Fourier transform-infrared (FT-IR) spectroscopy, and many more [2]

  • Cells were divided into three groups: one group was placed in a 95% air and 5% CO2 incubator; one group placed in a 1% O2, 5% CO2 balanced with N2 hypoxybox and one group was placed in an anoxic chamber (Bactron anaerobic chamber, Sheldon Manufacturing, Cornelius, OR, USA) where 5% CO2, 5% H2 and 90% N2 (BOC, Manchester, UK) was flowed over a palladium catalyst to remove any remaining oxygen for 24 h

  • As discussed above in order to compare five different missing value substitutes we used data that had been generated from the metabolic profiling of MDA-MB-231 breast cancer cells cultured in three oxygen levels—normoxia, hypoxia and anoxia—that had been treated with 0.1 or 1 μM doxorubicin drug, along with a control where no drug was used

Read more

Summary

Introduction

The classical technologies for the analysis of metabolites (i.e., chemical entities) include: Gas chromatography-mass spectrometry (GC-MS), liquid chromatography-mass spectrometry (LC-MS), capillary electrophoresis-mass spectrometry (CE-MS), nuclear magnetic resonance (NMR) spectroscopy, Fourier transform-infrared (FT-IR) spectroscopy, and many more [2]. These methods have been widely applied in countless metabolomics research studies. Not all of these methods generate a complete data set, for instance GC-MS and LC-MS analyses employ chromatographic separation prior to MS and require a complex deconvolution step to transform these 3D data matrices into lists of annotated features (metabolites) with (relative) abundances The result of this process is often a matrix that is not fully populated and presents a major problem in data processing due to these missing values [6]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call