Abstract
Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.
Highlights
One of the most important steps after sequencing and assembling a microbial genome is the annotation of its protein-coding genes
Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives
It is difficult if not impossible to prove that a predicted open reading frame is not a gene; these hypothetical proteins have remained in genome annotation databases for many years
Summary
One of the most important steps after sequencing and assembling a microbial genome is the annotation of its protein-coding genes. Used prokaryotic gene finders include various iterations of Glimmer [1, 2], GeneMark [3, 4], and Prodigal [5], all of which are based on Markov models and which utilize an array of biologically-inspired heuristics Each of these previous methods requires a bootstrapping step to train its internal gene model on each new genome. All current software tools predict hundreds or thousands of “extra” genes per genome, i.e., genes that do not match any gene with a known function and are usually given the name “hypothetical protein.” Many of these hypothetical genes likely represent genuine protein coding sequences, but many others may be false positive predictions. Systematically annotating false positives as genes may create problems for downstream analyses of genome function [7]
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have