PepExplorer: A Similarity-driven Tool for Analyzing de Novo Sequencing Results

Felipe V Leprevost,Richard H Valente,Diogo B Lima,Jonas Perales,Rafael Melani,John R Yates,Valmir C Barbosa,Magno Junqueira,Paulo C Carvalho

doi:10.1074/mcp.m113.037002

Abstract

Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith-Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.

Highlights

Ated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database
To tackle the aforementioned shortcomings, and in line with our strong interest in diversity-driven proteomics [29], we present a methodology for post-processing de novo sequencing data that allows inference of protein identification through statistical mapping of de novo sequencing results to a protein sequence database
We further manually examined the non-decoy proteins uniquely identified by PepExplorer; because we individually analyzed each case, based on spectral quality, alignment scores, and coverage, we feel comfortable in considering them as correctly identified, even though they were not found in our gold-standard search, ProLuCID

Summary

Introduction

Ated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. In scenarios such as these, proteomics has the potential to allow a better understanding of the complexity of biological systems and the process of evolution than the study of the genetic code alone. In 2007, Elias and Gygi published a seminal paper on the target-decoy approach to shotgun proteomics [7] that firmed this approach as a standard and motivated the development of several statistical filters capable of converging to a list of confident identifications satisfying a user-specified false-discovery rate (FDR) with significantly more sensitivity than the conservative Washburn criterion. Increasing the search space in the PSM approach leads to decreased sensitivity [13]

Methods

Results

Conclusion