Abstract

Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. Traditional search engines, which match peptide sequences with tandem mass spectra to identify the samples' proteins, use protein sequence databases to suggest peptide candidates for consideration. Although the acquisition of tandem mass spectra is not biased toward well-understood protein isoforms, this computational strategy is failing to identify peptides from alternative splicing and coding SNP protein isoforms despite the acquisition of good-quality tandem mass spectra. We propose, instead, that expressed sequence tags (ESTs) be searched. Ordinarily, such a strategy would be computationally infeasible due to the size of EST sequence databases; however, we show that a sophisticated sequence database compression strategy, applied to human ESTs, reduces the sequence database size approximately 35-fold. Once compressed, our EST sequence database is comparable in size to other commonly used protein sequence databases, making routine EST searching feasible. We demonstrate that our EST sequence database enables the discovery of novel peptides in a variety of public data sets.

Highlights

  • Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples

  • We propose to remove the computational bias imposed by the use of protein sequence databases in peptide identification & 2007 EMBO and Nature Publishing Group

  • We demonstrate that our compressed human expressed sequence tags (ESTs) peptide sequence database makes it possible to re-search publicly available tandem mass spectra from human samples, such as that in the PeptideAtlas (Desiere et al, 2006) and the Human Proteome Organization (HUPO) Plasma Proteome Project (PPP) (Omenn et al, 2005) data repositories, to look for, and find, known coding SNPs, novel coding mutations, alternative splicing isoforms, alternative translation start sites, microexons, and alternative translation frames

Read more

Summary

Introduction

Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. We demonstrate that our compressed human EST peptide sequence database makes it possible to re-search publicly available tandem mass spectra from human samples, such as that in the PeptideAtlas (Desiere et al, 2006) and the Human Proteome Organization (HUPO) Plasma Proteome Project (PPP) (Omenn et al, 2005) data repositories, to look for, and find, known coding SNPs, novel coding mutations, alternative splicing isoforms, alternative translation start sites, microexons, and alternative translation frames Many of these novel peptides, which are missing from current protein sequence databases, straddle exon boundaries and could not have been observed by searching the six-frame translation of the human genome directly, a strategy proposed by Fermin et al (2006) for the HUPO PPP project. The optimal complete, correct (C2) sequence database construction is described in the Materials and methods section

Results and discussion
Molecular Systems Biology 2007
Materials and methods
Capacity
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.