Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

Stéphanie Bocs,Claudine Médigue,Antoine Danchin

doi:10.1186/1471-2105-3-5

Stéphanie Bocs, Claudine Médigue + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-3-5

Copy DOI

Abstract

BackgroundAnalysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach.ResultsWe have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank.ConclusionsThe results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).

Highlights

Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes
From the set of entirely missed annotated genes (i.e. Gene Not Found, GNF = Original Annotation (OA)-CC) and the set of newly predicted genes, the percentage of genes in each category is given according with reference to the value of their average coding probability (Pc)
We found that a sizeable amount of genes annotated within the framework of large-scale sequencing projects are likely to be partially inaccurate or plainly wrong (2%)

Summary

Introduction

Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A typical example of such methods is the GeneMark software [2], a deservedly popular gene prediction program for prokaryotes, which uses periodical Markov models to find DNA regions that code for proteins. The translation in all the six frames of the query DNA is required to compare the resulting amino acid sequences to known proteins (BLASTX program). This method has been shown to be relatively effective for gene finding [4], it is too time-consuming to be used as a common procedure. It has been recently shown that a great many spurious short genes are generally annotated in genomes [5], and that the number of potential errors in the prediction of functional annotation is higher than is usually believed, mainly because it is based on relatively weak sequence identities and/or partial alignments [6]

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2002
Citations: 78	License type: cc-by

R Discovery Prime

R Discovery Prime

Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Editorial: Z-curve Applications in Genome Analysis.
Chun-Ting Zhang
Current genomics | VOL. 15
Chun-Ting ZhangChun-Ting Zhang
01 Apr 2014
Current genomics | VOL. 15

Non-essential ribosomal proteins in bacteria and archaea identified using COGs.
Michael Y Galperin ... Sofya K Garushyants
Journal of Bacteriology | VOL. 203
Michael Y Galperin, et. al.Michael Y Galperin ... Sofya K Garushyants
07 May 2021
Journal of Bacteriology | VOL. 203

MICheck: a web tool for fast checking of syntactic annotations of bacterial genomes
S Cruveiller ... D Vallenet
Nucleic Acids Research | VOL. 33
S Cruveiller, et. al.S Cruveiller ... D Vallenet
27 Jun 2005
Nucleic Acids Research | VOL. 33

A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome
Jia-Feng Yu ... Yue Hou
Genes & Genomics | VOL. 37
Jia-Feng Yu, et. al.Jia-Feng Yu ... Yue Hou
08 Jan 2015
Genes & Genomics | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Re-annotation of genome microbial CoDing-Sequences: finding new genes and inaccurately annotated genes

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics