Abstract

Missing values exist widely in mass-spectrometry (MS) based metabolomics data. Various methods have been applied for handling missing values, but the selection can significantly affect following data analyses. Typically, there are three types of missing values, missing not at random (MNAR), missing at random (MAR), and missing completely at random (MCAR). Our study comprehensively compared eight imputation methods (zero, half minimum (HM), mean, median, random forest (RF), singular value decomposition (SVD), k-nearest neighbors (kNN), and quantile regression imputation of left-censored data (QRILC)) for different types of missing values using four metabolomics datasets. Normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate imputation accuracy. Principal component analysis (PCA)/partial least squares (PLS)-Procrustes analysis were used to evaluate the overall sample distribution. Student’s t-test followed by correlation analysis was conducted to evaluate the effects on univariate statistics. Our findings demonstrated that RF performed the best for MCAR/MAR and QRILC was the favored one for left-censored MNAR. Finally, we proposed a comprehensive strategy and developed a public-accessible web-tool for the application of missing value imputation in metabolomics (https://metabolomics.cc.hawaii.edu/software/MetImp/).

Highlights

  • Metabolomics is the study of systematic identification and/or quantification of wide ranges of small molecule metabolites in bio-samples

  • We found that k-nearest neighbors (kNN) imputation method produced even larger normalized root mean squared error (NRMSE) than two determined value imputation methods when the missing proportion increased to certain points

  • Results of sum of ranks (SOR) (Fig. 2a,b) showed that all three imputation methods, random forest (RF), singular value decomposition (SVD), and kNN, performed poorly on missing not at random (MNAR), together with Zero imputation that had been commonly used in metabolomics data analysis

Read more

Summary

Introduction

Metabolomics is the study of systematic identification and/or quantification of wide ranges of small molecule metabolites in bio-samples (cell, tissue, and biological fluids, etc.). Guida et al investigated different data processing methods (including normalization, imputation, transformation, and scaling) in non-targeted metabolomics They concluded that RF performed best in PCA and kNN was recommended for partial least squares-discriminant analysis (PLS-DA)[26]. We systematically measured the performance of those imputation methods using three different ways: (1) normalized root mean squared error (NRMSE) and NRMSE-based sum of ranks (SOR) were applied to evaluate the imputation accuracy for MCAR/MAR and MNAR correspondingly; (2) principal component analysis (PCA)/ partial least squares (PLS)-Procrustes sum of squared error were used to evaluate the overall sample distribution; and (3) student’s t-test followed by Pearson correlation analysis was conducted to evaluate the effect of imputation on univariate statistical analysis. We evaluated the types of missing values in two different real metabolomics datasets, and found MCAR/MAR widely occurred in GC/MS profiling data and MNAR existed in LC/MS targeted data. Taking account of removing missing variables that contain big proportions of missing values beforehand, we proposed a comprehensive strategy and a public-accessible web-tool for the public to deal with missing values in metabolomics studies (https://metabolomics.cc.hawaii.edu/software/MetImp/)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call