NextSearch: A Search Engine for Mass Spectrometry Data against a Compact Nucleotide Exon Graph.

Hyunwoo Kim,Eunok Paek,Heejin Park

doi:10.1021/acs.jproteome.5b00047

Abstract

Proteogenomics research has been using six-frame translation of the whole genome or amino acid exon graphs to overcome the limitations of reference protein sequence database; however, six-frame translation is not suitable for annotating genes that span over multiple exons, and amino acid exon graphs are not convenient to represent novel splice variants and exon skipping events between exons of incompatible reading frames. We propose a proteogenomic pipeline NextSearch (Nucleotide EXon-graph Transcriptome Search) that is based on a nucleotide exon graph. This pipeline consists of constructing a compact nucleotide exon graph that systematically incorporates novel splice variations and a search tool that identifies peptides by directly searching the nucleotide exon graph against tandem mass spectra. Because our exon graph stores nucleotide sequences, it can easily represent novel splice variations and exon skipping events between incompatible reading frame exons. Searching for peptide identification is performed against this nucleotide exon graph, without converting it into a protein sequence in FASTA format, achieving an order of magnitude reduction in the size of the sequence database storage. NextSearch outputs the proteome-genome/transcriptome mapping results in a general feature format (GFF) file, which can be visualized by public tools such as the UCSC Genome Browser.

Full Text