Construction and assessment of individualized proteogenomic databases for large-scale analysis of nonsynonymous single nucleotide variants.

Karsten Krug,Alejandro Carpy,Christoph Taumer,Boris Macek,Sasa Popic

doi:10.1002/pmic.201400219

Abstract

Next-generation sequencing projects focusing on genomes and transcriptomes identify millions of single nucleotide variants (SNVs), many of which result in single amino acid substitutions. These nonsynonymous (ns) SNVs are typically not incorporated into protein sequence databases used to identify MS/MS data. Here, we perform a comparative analysis of the assembly of nsSNV-containing proteogenomic databases. We use a comprehensive transcriptome and proteome dataset of HeLa cells from the literature to derive and to incorporate SNVs into databases applicable to proteomics search engines, and to assess their performance in the identification of nsSNVs. We assemble the databases by (1) translation of SNV-containing transcripts into all possible reading frames, (2) translation of predicted reading frame, (3) prediction of nsSNVs and subsequent incorporation into canonical protein sequences. We show substantial differences between generated databases in terms of represented nsSNVs and theoretical search space, affecting sensitivity and specificity of database search. We query the databases with >2.2M high-resolution MS/MS spectra using MaxQuant software and identify 451 variant peptides, containing 401 nsSNVs. We conclude that prediction of reading frame and, if applicable, SNV effect result in comprehensive yet compact databases necessary to retain sensitivity in large-scale analysis of nsSNVs called from transcriptomics data.

Full Text