Abstract

To facilitate genome-based representation and analysis of proteomics data, we developed a new bioinformatics framework, proBAMsuite, in which a central component is the protein BAM (proBAM) file format for organizing peptide spectrum matches (PSMs)1 within the context of the genome. proBAMsuite also includes two R packages, proBAMr and proBAMtools, for generating and analyzing proBAM files, respectively. Applying proBAMsuite to three recently published proteomics datasets, we demonstrated its utility in facilitating efficient genome-based sharing, interpretation, and integration of proteomics data. First, the interpretation of proteomics data is significantly enhanced with the rich genomic annotation information. Second, PSMs can be easily reannotated using user-specified gene annotation schemes and assembled into both protein and gene identifications. Third, using the genome as a common reference, proBAMsuite facilitates seamless proteomics and proteogenomics data integration. Finally, proBAM files can be readily visualized in genome browsers and thus bring proteomics data analysis to a general audience beyond the proteomics community. Results from this study establish proBAMsuite as a useful bioinformatics framework for proteomics and proteogenomics research.

Highlights

  • Mass-spectrometry-based shotgun proteomics technology has undergone rapid advancements during the past decade

  • Peptide and protein identification relies primarily on protein databases derived from the reference genome sequence, genomic locations of identified peptides are not reported by commonly used mass spectrometry data analysis software, which limits genome-based interpretation and analysis of proteomics data and hinders effective proteogenomic data integration

  • We demonstrate its utility using three proteomics datasets: 1) CPTAC_CRC: proteomics data for 91 samples representing 86 The Cancer Genome Atlas colorectal cancer (CRC) tumors generated by the Vanderbilt Proteome Characterization Center in the National Cancer Institute Clinical Proteomic Tumor Analysis Consortium (CPTAC) [8]; 2) TUM_NCI_60: proteomics data for 61 samples representing 59 NCI-60 cell lines generated at the Technical University of Munich [4]; and 3) VU_CRC_10: proteomics data for 10 CRC cell lines generated at the Vanderbilt University School of Medicine [24]

Read more

Summary

Introduction

Mass-spectrometry-based shotgun proteomics technology has undergone rapid advancements during the past decade. Peptide and protein identification relies primarily on protein databases derived from the reference genome sequence, genomic locations of identified peptides are not reported by commonly used mass spectrometry data analysis software, which limits genome-based interpretation and analysis of proteomics data and hinders effective proteogenomic data integration. Those mapping to the same genomic locus can benefit from a gene-level instead of a protein-level inference; it is unclear how many and which peptides map to multiple proteins derived from the same genomic locus As another example, exon– exon junction peptides are important for the understanding of alternative splicing and protein isoform complexity, but it is difficult to determine how many and which peptides span more than one exon with existing data formats. As proteogenomics is rapidly becoming an attractive and important research field (10 –13), it is critical to have a new data format and supporting tools that enable smooth integration across proteomics, genomics, and transcriptomics data

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call