Abstract

An increasing amount of studies integrate mRNA sequencing data into MS-based proteomics to complement the translation product search space. However, several factors, including extensive regulation of mRNA translation and the need for three- or six-frame-translation, impede the use of mRNA-seq data for the construction of a protein sequence search database. With that in mind, we developed the PROTEOFORMER tool that automatically processes data of the recently developed ribosome profiling method (sequencing of ribosome-protected mRNA fragments), resulting in genome-wide visualization of ribosome occupancy. Our tool also includes a translation initiation site calling algorithm allowing the delineation of the open reading frames (ORFs) of all translation products. A complete protein synthesis-based sequence database can thus be compiled for mass spectrometry-based identification. This approach increases the overall protein identification rates with 3% and 11% (improved and new identifications) for human and mouse, respectively, and enables proteome-wide detection of 5′-extended proteoforms, upstream ORF translation and near-cognate translation start sites. The PROTEOFORMER tool is available as a stand-alone pipeline and has been implemented in the galaxy framework for ease of use.

Highlights

  • The integration of next-generation transcriptome sequencing and highly sensitive mass spectrometry (MS) has emerged as a powerful strategy for the fast and comprehensive profiling of mammalian proteomes [1]

  • The PROTEOFORMER pipeline (Figure 1) is made up of six major steps: (i) the alignment of the ribosome-protected fragments (RPFs) reads to a reference genome, (ii) a quality control of the alignments, (iii) assignment of transcripts with evidence of translation, (iv) identification of translation initiation sites (TIS), (v) inclusion of single nucleotide polymorphisms (SNPs) information and (vi) generation of a RIBO-seq derived translation product database that can be used as a search space for MSbased proteomics studies, either independently or combined with a canonical protein database

  • In order to test the performance of the PROTEOFORMER method, we optimized different modules toward the creation of a protein-synthesis based sequence database, using available mouse embryonic stem cell RIBO-seq data [8]

Read more

Summary

Introduction

The integration of next-generation transcriptome sequencing and highly sensitive mass spectrometry (MS) has emerged as a powerful strategy for the fast and comprehensive profiling of mammalian proteomes [1]. Protein sequence database search tools [2] typically use publicly available protein databases, such as Swiss-Prot and Ensembl, to match MS spectra to peptides. Because these reference databases only contain experimentally verified and/or predicted protein sequences, it is very unlikely that they give a comprehensive assessment of the expressed protein pool of a given sample. Translation product prediction based on messenger RNA sequencing (mRNA-seq) data gives a more representative state of the protein repertoire expressed and aids the protein identification process by eliminating unexpressed gene products from the search space [3]. Inclusion of mRNA-seq information requires three- or six-frame-translation of the derived sequences, dramatically expanding the protein search space and decreasing the search sensitivity and specificity [9]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call