Abstract

Missing values are a genuine issue in label-free quantitative proteomics. Recent works have surveyed the different statistical methods to conduct imputation and have compared them on real or simulated data sets and recommended a list of missing value imputation methods for proteomics application. Although insightful, these comparisons do not account for two important facts: (i) depending on the proteomics data set, the missingness mechanism may be of different natures and (ii) each imputation method is devoted to a specific type of missingness mechanism. As a result, we believe that the question at stake is not to find the most accurate imputation method in general but instead the most appropriate one. We describe a series of comparisons that support our views: For instance, we show that a supposedly "under-performing" method (i.e., giving baseline average results), if applied at the "appropriate" time in the data-processing pipeline (before or after peptide aggregation) on a data set with the "appropriate" nature of missing values, can outperform a blindly applied, supposedly "better-performing" method (i.e., the reference method from the state-of-the-art). This leads us to formulate few practical guidelines regarding the choice and the application of an imputation method in a proteomics context.

Highlights

  • The high rate of missing values in label-free quantitative proteomics is a major concern.[1]

  • (2) In the absence of knowledge about the nature(s) of missing values in a particular quantitative proteomics data set, it makes sense to rely on a MCAR/MAR imputation method

  • This is supported by numerous experiments, including ours as well as those from ref 13 and by theoretical arguments: By definition, missing values that should be imputed by small intensities can show up in a MCAR context, while, on the contrary, a method devoted to left-censored missing value will systematically perform poorly on other types of missing values

Read more

Summary

INTRODUCTION

The high rate of missing values in label-free quantitative proteomics is a major concern.[1]. In mass-spectrometry-based analysis, chemical species whose abundances are close enough to the limit of detection of the instrument record a higher rate of missing values This is why MNAR-devoted imputation methods used in proteomics focus on left-censored data. MNAR (including leftcensored) mechanisms are discipline-specific, so that a precise understanding of the mechanism underlying the data generation is mandatory This is why, in the comparisons depicted in ref 13, among the nine methods, only three MNARdevoted approaches were considered, among which two are based on the same principle. Numerous conclusions and recommendations can be drawn from these experiments; beyond them, our work pinpoints the fact that most of the conclusions regarding imputation methods cannot be claimed to hold in general On the contrary, they should be contextualized according to each data set, the proportion of missing values, and their nature

MATERIAL
METHODS
RESULTS
CONCLUSIONS
■ ACKNOWLEDGMENTS
■ REFERENCES
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call