Abstract
BackgroundA large number of animal and plant genomes have been completely sequenced over the last decade and are now publicly available. Although genomes can be rapidly sequenced, identifying protein-coding genes still remains a problematic task. Availability of protein sequence data allows direct confirmation of protein-coding genes. Mass spectrometry has recently emerged as a powerful tool for proteomic studies. Protein identification using mass spectrometry is usually carried out by searching against databases of known proteins or transcripts. This approach generally does not allow identification of proteins that have not yet been predicted or whose transcripts have not been identified.ResultsWe searched 3,967 mass spectra from 16 LC-MS/MS runs of Anopheles gambiae salivary gland homogenates against the Anopheles gambiae genome database. This allowed us to validate 23 known transcripts and 50 novel transcripts. In addition, a novel gene was identified on the basis of peptides that matched a genomic region where no gene was known and no transcript had been predicted. The amino termini of proteins encoded by two predicted transcripts were confirmed based on N-terminally acetylated peptides sequenced by tandem mass spectrometry. Finally, six sequence polymorphisms could be annotated based on experimentally obtained peptide sequences.ConclusionThe peptide sequences from this study were mapped onto the genomic sequence using the distributed annotation system available at Ensembl and can be visualized in the context of all other existing annotations. The strategy described in this paper can be used to correct and confirm genome annotations and permit discovery of novel proteins in a high-throughput manner by mass spectrometry.
Highlights
A large number of animal and plant genomes have been completely sequenced over the last decade and are publicly available
A validation of predicted transcripts could be accomplished through the use of direct peptide sequence data such as that obtained by tandem mass spectrometry in our study
We will illustrate seven different situations in which mass spectrometry data assisted us in the genome annotation: a) peptide sequences that matched exons in known transcripts; b) peptide sequences that matched exons in novel transcripts; c) peptide sequences that matched regions of the genome where no genes were predicted at all; d) matching of peptide sequences regions annotated as untranslated regions (UTRs); e) matching of peptide sequences to regions annotated as introns of known or novel genes; f) data on N-terminal acetylation sites for mapping the amino terminus of the mature protein; and, g) sequence polymorphisms that could represent coding single nucleotide polymorphisms
Summary
A large number of animal and plant genomes have been completely sequenced over the last decade and are publicly available. The use of mass spectrometry to assist the validation of genome annotation has been previously demonstrated in prokaryotes [4], yeast [5], plants [6] and humans [7] Two of these studies [5,7] did not directly search mass spectrometry-derived data against the genomic databases – rather, a post hoc integration of peptide sequences with the genomic sequence was carried out. This approach is not preferable for annotating genomes because if there is any region of a genome that has no transcript associated with it (e.g. introns and intergenic regions), it will not be identified
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have