Abstract

Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

Highlights

  • Sequencing the first complete genome of Haemophilus influenzae in 1995 opened a new page in genome sciences

  • In this paper we briefly review the problems associated with identification of coding sequences (CDS) in bacterial and archaeal genomes and demonstrate how comparative genomics can help in the location of missed genes

  • Errors can accumulate at different stages, from genome sequencing to the assignment of metabolic pathways

Read more

Summary

Introduction

Sequencing the first complete genome of Haemophilus influenzae in 1995 opened a new page in genome sciences. In this paper we briefly review the problems associated with identification of coding sequences (CDS) in bacterial and archaeal genomes and demonstrate how comparative genomics can help in the location of missed genes. In the absence of introns, it might have seemed that ORFs can be designated as any substring of DNA that begins with a start codon and ends with a stop codon If we apply this rule to any bacterial or archaeal genome we will obtain many overlapping and short ORFs. A difficult task for gene prediction software is to decide which one from two overlapping ORFs represents a true gene. Short erroneously identified ORFs would not greatly impact an orthologue collection as they most likely would not have orthologues in other genomes They will mistakenly increase the set of ORFans, the origin of which is still a debated question (Siew & Fischer, 2003). The percentage of missed genes in one genome likely does not exceed 5–10 %; even a single missed gene can lead to a wrong biological inference, especially if the missed gene is a key enzyme of a metabolic pathway

Error estimation in annotated genomes
Comparative genome analysis as a refining tool
No description on CDS finding available
Prospects for a perfect annotation
Findings
Concluding remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call