Abstract

BackgroundAnalysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A major reason for this situation is that genes are tightly packed in prokaryotes, resulting in frequent overlap. Thus, detection of translation initiation sites and/or selection of the correct coding regions remain difficult unless appropriate biological knowledge (about the structure of a gene) is imbedded in the approach.ResultsWe have developed a new program that automatically identifies biologically significant candidate genes in a bacterial genome. Twenty-six complete prokaryotic genomes were analyzed using this tool, and the accuracy of gene finding was assessed by comparison with existing annotations. This analysis revealed that, despite the enormous effort of genome program annotators, a small but not negligible number of genes annotated within the framework of sequencing projects are likely to be partially inaccurate or plainly wrong. Moreover, the analysis of several putative new genes shows that, as expected, many short genes have escaped annotation. In most cases, these new genes revealed frameshifts that could be either artifacts or genuine frameshifts. Some entirely unexpected new genes have also been identified. This allowed us to get a more complete picture of prokaryotic genomes. The results of this procedure are progressively integrated into the SWISS-PROT reference databank.ConclusionsThe results described in the present study show that our procedure is very satisfactory in terms of gene finding accuracy. Except in few cases, discrepancies between our results and annotations provided by individual authors can be accounted for by the nature of each annotation process or by specific characteristics of some genomes. This stresses that close cooperation between scientists, regular update and curation of the findings in databases are clearly required to reduce the level of errors in genome annotation (and hence in reducing the unfortunate spreading of errors through centralized data libraries).

Highlights

  • Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes

  • From the set of entirely missed annotated genes (i.e. Gene Not Found, GNF = Original Annotation (OA)-CC) and the set of newly predicted genes, the percentage of genes in each category is given according with reference to the value of their average coding probability (Pc)

  • We found that a sizeable amount of genes annotated within the framework of large-scale sequencing projects are likely to be partially inaccurate or plainly wrong (2%)

Read more

Summary

Introduction

Analysis of any newly sequenced bacterial genome starts with the identification of protein-coding genes. Despite the accumulation of multiple complete genome sequences, which provide useful comparisons with close relatives among other organisms during the annotation process, accurate gene prediction remains quite difficult. A typical example of such methods is the GeneMark software [2], a deservedly popular gene prediction program for prokaryotes, which uses periodical Markov models to find DNA regions that code for proteins. The translation in all the six frames of the query DNA is required to compare the resulting amino acid sequences to known proteins (BLASTX program). This method has been shown to be relatively effective for gene finding [4], it is too time-consuming to be used as a common procedure. It has been recently shown that a great many spurious short genes are generally annotated in genomes [5], and that the number of potential errors in the prediction of functional annotation is higher than is usually believed, mainly because it is based on relatively weak sequence identities and/or partial alignments [6]

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.