Abstract

Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the Patescibacteria lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.

Highlights

  • Genome annotation relies primarily on the detection of homology between newly identified genes/proteins and previously annotated sequences

  • Ever more extreme than this is the feline parasite Mycoplasma haemofelis, which has functional annotations for only 19 % of its proteome [12, 25]. With such a wide range of annotation coverage found among bacteria, we aimed to investigate the extent of annotation coverage across the bacterial tree of life, as well as to identify factors related to this important property of genomes

  • In order to explore patterns of genome annotation across bacteria, we analysed 27 372 bacterial genomes included as part of the AnnoTree database [1]

Read more

Summary

Introduction

Genome annotation relies primarily on the detection of homology between newly identified genes/proteins and previously annotated sequences. As a general summary of this process, genes predicted in newly sequenced genomes or metagenomes are translated and compared against reference databases to identify homologues, with functional annotations being transferred from those homologues to the query proteins [3]. Complicated by varying definitions of ‘function’ and ‘annotation’, homology-b­ ased annotation transfer has been systematically explored, revealing reasonable success rates (upwards of 60–70 % accuracy) based on assessment of Gene Ontology (GO) term prediction [4, 5]. Studies of early model organisms, such as Escherichia coli, Bacillus subtilis and Caulobacter crescentus, are a major. It is important to note that such limited sources can be expected to result in biases in genome annotation, with a greater success rate in species that are phylogenetically closer to these and other commonly studies species [6]

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call