Abstract

Human proteomic databases required for MS peptide identification are frequently updated and carefully curated, yet are still incomplete because it has been challenging to acquire every protein sequence from the diverse assemblage of proteoforms expressed in every tissue and cell type. In particular, alternative splicing has been shown to be a major source of this cell-specific proteomic variation. Many new alternative splice forms have been detected at the transcript level using next generation sequencing methods, especially RNA-Seq, but it is not known how many of these transcripts are being translated. Leveraging the unprecedented capabilities of next generation sequencing methods, we collected RNA-Seq and proteomics data from the same cell population (Jurkat cells) and created a bioinformatics pipeline that builds customized databases for the discovery of novel splice-junction peptides. Eighty million paired-end Illumina reads and ∼500,000 tandem mass spectra were used to identify 12,873 transcripts (19,320 including isoforms) and 6810 proteins. We developed a bioinformatics workflow to retrieve high-confidence, novel splice junction sequences from the RNA data, translate these sequences into the analogous polypeptide sequence, and create a customized splice junction database for MS searching. Based on the RefSeq gene models, we detected 136,123 annotated and 144,818 unannotated transcript junctions. Of those, 24,834 unannotated junctions passed various quality filters (e.g. minimum read depth) and these entries were translated into 33,589 polypeptide sequences and used for database searching. We discovered 57 splice junction peptides not present in the Uniprot-Trembl proteomic database comprising an array of different splicing events, including skipped exons, alternative donors and acceptors, and noncanonical transcriptional start sites. To our knowledge this is the first example of using sample-specific RNA-Seq data to create a splice-junction database and discover new peptides resulting from alternative splicing.

Highlights

  • Mass spectrometry-based proteomics relies on accurate databases to identify and quantify proteins, including those derived from splice variants, indels, and single nucleotide variants (SNVs)1 [1]

  • The abbreviations used are: SNV, single nucleotide variant; cDNA, complementary DNA; FASP, filter aided sample preparation; GENCODE, component of the ENCODE project that aims to build accurate human reference annotations; GTF, gene annotation file; ppm, parts per million; Percentage Spliced In” (PSI), percentage spliced in; RNA-Seq, RNA Sequencing; RSEM, RNA-Seq Expectation Maximization; SDS and DTT-based buffer (SDT), Buffer used in FASP protocol containing SDS and dithiothreitol; TPM, transcripts per million; XCorr, SEQUEST cross-correlation score

  • The most common splicing events were small insertions and deletions occurring at the 3Ј acceptor exons, frequently characterized by the NAGNAG motifs where two AG dinucleotide splice site acceptors sit in close proximity to each other: this agrees with recent gene validation efforts of the GENCODE gene annotation project in which mass spectrometry data retrieved from the Global Proteome Machine (GPM) and PeptideAtlas were aligned to GENCODE gene models to assess the number of translated products [17]

Read more

Summary

Technological Innovation and Resources

Discovery and Mass Spectrometric Analysis of Novel Splice-junction Peptides Using RNA-Seq*□S. Though the focus of this paper is on the study of alternative splice junctions, other bioinformatics strategies to extract information from RNA-Seq data have been employed to create customized mass spectrometry databases These include reducing a database to only include sequences with transcript expression evidence [40], including fusion or chimeric sequences [44], incorporating nonsynonymous single nucleotide polymorphism (SNP) or SNV sequences [40], and, for non-model systems, building a proteomic database from de novo assembled transcripts [45, 46]. We discovered 57 splice junction peptides not present in the Uniprot-Trembl proteomic database using appropriately stringent MS search parameters and post-processing steps, including the use of a conservative 1% local false discovery rate and manual validation of junction peptide MS2 spectra To our knowledge this is the first example of using sample-specific RNA-Seq data to discover new peptides resulting from alternative splicing

EXPERIMENTAL PROCEDURES
Jurkat Cells
RESULTS
Relative Abundance
Within intron
DISCUSSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call