Abstract
Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.
Highlights
Shotgun proteomics relies on liquid chromatography and tandem mass spectrometry to identify proteins in complex biological mixtures
The idea is that the decoy peptide-spectrum match (PSM) make a good model of the incorrect target matches, so that the error rates can be estimated [18]
The three subsets of PSMs are merged, and the overall error rates are estimated by target-decoy analysis on all PSMs
Summary
Shotgun proteomics relies on liquid chromatography and tandem mass spectrometry to identify proteins in complex biological mixtures. Because there are many error sources associated both with the mass spectrometer and the matching procedures, correct and incorrect PSMs cannot be completely discriminated using raw scores For this reason, an important step in the analysis is to estimate the error rate associated with a given score threshold. The target-decoy approach has been used to increase the score discrimination between correct and incorrect PSMs using semi-supervised machine learning [12,13,14,15,16,17] This increased discrimination is highly valuable, because it typically results in a considerably higher number of confident peptide identifications. The effect of the validation is demonstrated using an example based on simulated data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.