Missing genes in the annotation of prokaryotic genomes

Andrew S Warren,Jeremy Archuleta,Wu-Chun Feng,João Carlos Setubal

doi:10.1186/1471-2105-11-131

Abstract

BackgroundProtein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. However there have been reports that prokaryotic gene finder programs have problems with small genes (either over-predicting or under-predicting). Therefore the question arises as to whether current genome annotations have systematically missing, small genes.ResultsWe have developed a high-performance computing methodology to investigate this problem. In this methodology we compare all ORFs larger than or equal to 33 aa from all fully-sequenced prokaryotic replicons. Based on that comparison, and using conservative criteria requiring a minimum taxonomic diversity between conserved ORFs in different genomes, we have discovered 1,153 candidate genes that are missing from current genome annotations. These missing genes are similar only to each other and do not have any strong similarity to gene sequences in public databases, with the implication that these ORFs belong to missing gene families. We also uncovered 38,895 intergenic ORFs, readily identified as putative genes by similarity to currently annotated genes (we call these absent annotations). The vast majority of the missing genes found are small (less than 100 aa). A comparison of select examples with GeneMark, EasyGene and Glimmer predictions yields evidence that some of these genes are escaping detection by these programs.ConclusionsProkaryotic gene finders and prokaryotic genome annotations require improvement for accurate prediction of small genes. The number of missing gene families found is likely a lower bound on the actual number, due to the conservative criteria used to determine whether an ORF corresponds to a real gene.

Highlights

Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes
By comparing the coordinates of each Open Reading Frames (ORFs) in the genomic sequence with its current annotation, we separate ORFs into three groups: (1) those that coincide with currently annotated genes; (2) those that overlap with an annotated gene or other annotated entity, e.g. RNA genes, pseudogenes, etc.; and (3) those that do not share genomic space with any annotated entity
There were 1,121,362 intergenic sequences that went “unclassified,” and such ORFs are a clear target for finding additional genes

Summary

Introduction

Protein-coding gene detection in prokaryotic genomes is considered a much simpler problem than in intron-containing eukaryotic genomes. The most widely used gene finding programs build a gene model based on the characteristics of sequences which are likely to be real genes [4,5,6]. This model is used to evaluate the likelihood that an individual segment codes for a gene. In using this method it is possible to miss genes with anomalous sequence composition. Genes that do not fit a genomic pattern and do not have similar sequences in current annotation databases may be missed. If this problem occurs frequently in genome annotation projects, many such genes may be missing from current prokaryotic annotation databases

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Mar 15, 2010
Citations: 141	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Missing genes in the annotation of prokaryotic genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Expressed Peptide Tags: An Additional Layer of Data for Genome Annotation
Alon Savidor ... Kurt H Lamour
Journal of Proteome Research | VOL. 5
Alon Savidor, et. al.Alon Savidor ... Kurt H Lamour
07 Oct 2006
Journal of Proteome Research | VOL. 5

Inconsistencies of genome annotations in apicomplexan parasites revealed by 5'-end-one-pass and full-length sequences of oligo-capped cDNAs
Hiroyuki Wakaguri ... Junichi Watanabe
BMC Genomics | VOL. 10
Hiroyuki Wakaguri, et. al.Hiroyuki Wakaguri ... Junichi Watanabe
15 Jul 2009
BMC Genomics | VOL. 10

OrthoFiller: utilising data from multiple species to improve the completeness of genome annotations
Michael P Dunne ... Steven Kelly
BMC Genomics | VOL. 18
Michael P Dunne, et. al.Michael P Dunne ... Steven Kelly
18 May 2017
BMC Genomics | VOL. 18

Integration of mass spectrometry and RNA‐Seq data to confirm human ab initio predicted genes and lncRNAs
Han Sun ... Yixue Li
PROTEOMICS | VOL. 14
Han Sun, et. al.Han Sun ... Yixue Li
01 Dec 2014
PROTEOMICS | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Missing genes in the annotation of prokaryotic genomes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics