Balrog: A universal protein model for prokaryotic gene prediction.

Markus J Sommer,Steven L Salzberg

doi:10.1371/journal.pcbi.1008727

Markus J Sommer, Steven L Salzberg

Open Access

PDF Available

https://doi.org/10.1371/journal.pcbi.1008727

Copy DOI

Export

Save

Cite

Journal: PLOS Computational Biology	Publication Date: Feb 26, 2021
Citations: 27	License type: CC BY 4.0

Affiliation: Johns Hopkins University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Low-cost, high-throughput sequencing has led to an enormous increase in the number of sequenced microbial genomes, with well over 100,000 genomes in public archives today. Automatic genome annotation tools are integral to understanding these organisms, yet older gene finding methods must be retrained on each new genome. We have developed a universal model of prokaryotic genes by fitting a temporal convolutional network to amino-acid sequences from a large, diverse set of microbial genomes. We incorporated the new model into a gene finding system, Balrog (Bacterial Annotation by Learned Representation Of Genes), which does not require genome-specific training and which matches or outperforms other state-of-the-art gene finding tools. Balrog is freely available under the MIT license at https://github.com/salzberg-lab/Balrog.

Highlights

One of the most important steps after sequencing and assembling a microbial genome is the annotation of its protein-coding genes
Many hypothetical gene predictions likely represent true protein-coding sequence, but it is not known how many of them represent false positives
It is difficult if not impossible to prove that a predicted open reading frame is not a gene; these hypothetical proteins have remained in genome annotation databases for many years

Summary

Introduction

One of the most important steps after sequencing and assembling a microbial genome is the annotation of its protein-coding genes. Used prokaryotic gene finders include various iterations of Glimmer [1, 2], GeneMark [3, 4], and Prodigal [5], all of which are based on Markov models and which utilize an array of biologically-inspired heuristics Each of these previous methods requires a bootstrapping step to train its internal gene model on each new genome. All current software tools predict hundreds or thousands of “extra” genes per genome, i.e., genes that do not match any gene with a known function and are usually given the name “hypothetical protein.” Many of these hypothetical genes likely represent genuine protein coding sequences, but many others may be false positive predictions. Systematically annotating false positives as genes may create problems for downstream analyses of genome function [7]

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

Balrog: A universal protein model for prokaryotic gene prediction.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: PLOS Computational Biology

Lead the way for us

Similar Papers

Balrog: A universal protein model for prokaryotic gene prediction
Christos A Ouzounis ... Markus J Sommer
-
Christos A Ouzounis, et. al.Christos A Ouzounis ... Markus J Sommer
26 Feb 2021
26 Feb 2021

Next generation genome annotation with mGene.ngs
Jonas Behr ... Georg Zeller
BMC Bioinformatics | VOL. 11
Jonas Behr, et. al.Jonas Behr ... Georg Zeller
01 Dec 2010
BMC Bioinformatics | VOL. 11

The Fast Changing Landscape of Sequencing Technologies and Their Impact on Microbial Genome Assemblies and Annotation
Konstantinos Mavromatis ... Alla Lapidus
PLoS ONE | VOL. 7
Konstantinos Mavromatis, et. al.Konstantinos Mavromatis ... Alla Lapidus
12 Dec 2012
PLoS ONE | VOL. 7

Incremental and Interaction-Based Knowledge Acquisition for Medical Images in THESEUS
Daniel Sonntag
-
Daniel SonntagDaniel Sonntag
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Balrog: A universal protein model for prokaryotic gene prediction.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: PLOS Computational Biology