Abstract

Comprehensive characterization of a proteome defines a fundamental goal in proteomics. In order to maximize proteome coverage for a complex protein mixture, i.e., to identify as many proteins as possible, various different fractionation experiments are typically performed and the individual fractions are subjected to mass spectrometric analysis. The resulting data are integrated into large and heterogeneous datasets. Proteome coverage prediction refers to the task of extrapolating the number of protein discoveries by future measurements conditioned on a sequence of already performed measurements. Proteome coverage prediction at an early stage enables experimentalists to design and plan efficient proteomics studies. To date, there does not exist any method that reliably predicts proteome coverage from integrated datasets. We present a generalized hierarchical Pitman-Yor process model that explicitly captures the redundancy within integrated datasets. The accuracy of our approach for proteome coverage prediction is assessed by applying it to an integrated proteomics dataset for the bacterium L. interrogans. The proposed procedure outperforms ad hoc extrapolation methods and prediction methods designed for non-integrated datasets. Furthermore, the maximally achievable proteome coverage is estimated for the experimental setup underlying the L. interrogans dataset. We discuss the implications of our results for determining rational stop criteria and their influence on the design of efficient and reliable proteomics studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call