Abstract
BackgroundMass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Both approaches have shortcomings, so an integrative approach has been developed. The integrative approach firstly infers partial peptide sequences, known as tags, directly from tandem spectra through de novo sequencing, and then puts these sequences into a database search to see if a close peptide match can be found. However the current implementation of this integrative approach has several limitations. Firstly, simplistic de novo sequencing is applied and only very short sequence tags are used. Secondly, most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. Thirdly, by applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching.ResultsWe have developed a new integrative protein identification method which can integrate de novo sequencing more efficiently into database searching. Evaluated on large real datasets, our method outperforms popular identification methods.
Highlights
Mass spectrometry-based protein identification is a very challenging task
Evaluation strategy Datasets To evaluate the performance of our method, we use the raw spectra from two large-scale datasets as a benchmark: (1) the Aurum dataset [25] and (2) the CPTAC dataset [26] from Clinical Proteomic Technologies Assessment for Cancer
The CPTAC dataset comes from a large-scale study of the reproducibility and repeatability of the Universal Proteomics Standard Set 1 (UPS1)
Summary
Mass spectrometry-based protein identification is a very challenging task. The main identification approaches include de novo sequencing and database searching. Most integrative methods apply an algorithm similar to BLAST to search for exact sequence matches and do not accommodate sequence errors well. By applying these methods the integrated de novo sequencing makes a limited contribution to the scoring model which is still largely based on database searching. Accurate identification of proteins from tandem mass spectra is a very challenging task and existing methods can typically identify fewer than 50% of the proteins in a complex sample [1-3]. Despite having the advantage of robustness, the database search approach has several limitations It is only effective if the proteins of interest are already known and the utilised database contains the correct protein sequences. Specifying the enzyme used in the proteolytic digestion can exclude the correct peptides from the database search space and lead to erroneous identifications [10]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.