Abstract

Protein quantification by label-free shotgun proteomics experiments is plagued by a multitude of error sources. Typical pipelines for identifying differential proteins use intermediate filters to control the error rate. However, they often ignore certain error sources and, moreover, regard filtered lists as completely correct in subsequent steps. These two indiscretions can easily lead to a loss of control of the false discovery rate (FDR). We propose a probabilistic graphical model, Triqler, that propagates error information through all steps, employing distributions in favor of point estimates, most notably for missing value imputation. The model outputs posterior probabilities for fold changes between treatment groups, highlighting uncertainty rather than hiding it. We analyzed 3 engineered data sets and achieved FDR control and high sensitivity, even for truly absent proteins. In a bladder cancer clinical data set we discovered 35 proteins at 5% FDR, whereas the original study discovered 1 and MaxQuant/Perseus 4 proteins at this threshold. Compellingly, these 35 proteins showed enrichment for functional annotation terms, whereas the top ranked proteins reported by MaxQuant/Perseus showed no enrichment. The model executes in minutes and is freely available at https://pypi.org/project/triqler/.

Highlights

  • Shotgun proteomics has in recent years made rapid advances from being a tool for large-scale identification to include accurate quantification of proteins [23]. software packages have been developed to facilitate the quantitative interpretation of MS data, for a review see e.g. [21]

  • Error rates in protein quantification have been mostly limited to setting intermediate false discovery rate (FDR) thresholds for the identifications or other heuristic cutoffs, such as requiring at least a certain number of peptides [8, 3] or a certain correlation between peptide quantifications [41, 40]

  • We plotted the posterior distributions of the log2 fold changes between each pair of treatment groups obtained by Triqler and compared this to the Gaussian distribution obtained from the triplicate measurements for the naive pipeline

Read more

Summary

Introduction

Shotgun proteomics has in recent years made rapid advances from being a tool for large-scale identification to include accurate quantification of proteins [23]. software packages have been developed to facilitate the quantitative interpretation of MS data, for a review see e.g. [21]. Error rates in protein quantification have been mostly limited to setting intermediate false discovery rate (FDR) thresholds for the identifications or other heuristic cutoffs, such as requiring at least a certain number of peptides [8, 3] or a certain correlation between peptide quantifications [41, 40]. This gives no direct control of the errors in the reported lists of differential proteins and discards potentially valuable information for proteins that just missed one of the thresholds. Each of these methods has applied Bayesian statistics to parts of the quantification pipeline, but an integrated model for protein quantification is still lacking

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call