Abstract

BackgroundLC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. However, especially non-targeted metabolite profiling approaches generate vast arrays of data that are prone to aberrations such as missing values. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. Therefore, there is a need for proper imputation strategies that account for the missingness and reduce the bias in the statistical analysis.ResultsHere we present our results after evaluating nine imputation methods in four different percentages of missing values of different origin. The performance of each imputation method was analyzed by Normalized Root Mean Squared Error (NRMSE). We demonstrated that random forest (RF) had the lowest NRMSE in the estimation of missing values for Missing at Random (MAR) and Missing Completely at Random (MCAR). In case of absent values due to Missing Not at Random (MNAR), the left truncated data was best imputed with minimum value imputation. We also tested the different imputation methods for datasets containing missing data of various origin, and RF was the most accurate method in all cases. The results were obtained by repeating the evaluation process 100 times with the use of metabolomics datasets where the missing values were introduced to represent absent data of different origin.ConclusionType and rate of missingness affects the performance and suitability of imputation methods. RF-based imputation method performs best in most of the tested scenarios, including combinations of different types and rates of missingness. Therefore, we recommend using random forest-based imputation for imputing missing metabolomics data, and especially in situations where the types of missingness are not known in advance.

Highlights

  • liquid chromatography combined to mass spectrometry (LC-MS) technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis

  • In order to test the performance of the nine different imputation methods, we generated sub-datasets from 12 LC-MS metabolomics datasets (Fig. 1), and simulated missing values to these according to seven different missing mechanisms; Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), MCARMAR, MCAR-MNAR, MAR-MNAR, and MCAR-MARMNAR, in four different proportions of missing values (5, 10, 20 and 30%)

  • In order to avoid potentially biased comparisons, we have explored the performance of K-Nearest Neighbors (KNN) method by optimizing the parameter settings to reach optimal performance

Read more

Summary

Introduction

LC-MS technology makes it possible to measure the relative abundance of numerous molecular features of a sample in single analysis. No matter the reason for the missing values in the data, coherent and complete data matrix is always a pre-requisite for accurate and reliable statistical analysis. The raw data processing involves various steps including peak detection, peak alignment, adduct/neutral loss detection, baseline correction and noise reduction. It is one of the most challenging computational processes in the metabolomics experiment and prone to errors [8]. The related statistical analysis and the interpretation of metabolomics data will be biased, in case the missing values are not treated properly

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call