Abstract

BackgroundProteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. However, it also raises concerns on sensitive and reliable peptide identification. Spurious peptides included in target databases may result in underestimated false discovery rate (FDR). On the other hand, inflation of decoy databases could decrease the sensitivity of peptide identification due to the increased number of high-scoring random hits. Although several studies have addressed these issues, widely applicable guidelines for sensitive and reliable proteogenomic search have hardly been available.ResultsTo systematically evaluate the effect of database inflation in proteogenomic searches, we constructed a variety of real and simulated proteogenomic databases for yeast and human tandem mass spectrometry (MS/MS) data, respectively. Against these databases, we tested two popular database search tools with various approaches to search result validation: the target-decoy search strategy (with and without a refined scoring-metric) and a mixture model-based method. The effect of separate filtering of known and novel peptides was also examined. The results from real and simulated proteogenomic searches confirmed that separate filtering increases the sensitivity and reliability in proteogenomic search. However, no one method consistently identified the largest (or the smallest) number of novel peptides from real proteogenomic searches.ConclusionsWe propose to use a set of search result validation methods with separate filtering, for sensitive and reliable identification of peptides in proteogenomic search.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-3327-5) contains supplementary material, which is available to authorized users.

Highlights

  • Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research

  • Comparison between simulated and real proteogenomic search results To test the effectiveness of simulation experiments, peptide identification results were compared between the following simulated and real proteogenomic database pairs of similar sizes: ‘1T2Dy + 3Dy’ (9,186,837 + 9,186,837 amino acid (AA)) and ‘6-frame translation databases for yeast (6FTTy) + Six-frame translation decoy database for yeast (6FTDy)’ (9,654,965 + 9,654,965 AA) for yeast, and ‘1T2Dh + 3Dh’ (107,568,099 + 107,568,099 AA) and

  • We examined the proportion of peptides from reference protein sequences among the peptide identification results, because we hypothesized that a substantial amount of peptides added to reference protein sequences for proteogenomic search would not be real target but random sequences

Read more

Summary

Introduction

Proteogenomics is a promising approach for various tasks ranging from gene annotation to cancer research. Databases for proteogenomic searches are often constructed by adding peptide sequences inferred from genomic or transcriptomic evidence to reference protein sequences. Such inflation of databases has potential of identifying novel peptides. Proteogenomic search [1], i.e., searching tandem mass spectrometry (MS/MS) spectra against an integrated database consisting of reference proteins as well as protein sequences derived from genomic or transcriptomic evidence or hypotheses, is useful for identifying novel or sample-specific peptides. Proteogenomic search has been applied to various tasks such as discovering novel protein-coding regions [2, 10, 11], validation of gene annotation [12,13,14,15], and studying disease mechanisms for personalized diagnosis and treatments [16,17,18]. The increased size of proteogenomic databases demands a larger amount of computational resources, resulting in longer analysis time compared to the conventional proteomic database search

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call