A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

Megan J Bowman,Jane A Pulman,Kevin L Childs,Tiffany L Liu

doi:10.1186/s12859-017-1942-z

Abstract

BackgroundAccurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences.ResultsWe find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol.ConclusionsWe present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content.

Highlights

Accurate structural annotation depends on well-trained gene prediction programs
Reannotation of the O. sativa genome with MAKER using Hidden Markov Model (HMM) trained on high and low GC content We thought that grass genes identified by gene prediction programs that are trained on genes with specific GC content could both find different genes and produce differing gene models at identical loci than prediction programs that are trained on genes with random GC content
SNAP and AUGUSTUS HMMS were trained either with training genes randomly picked without regard for GC content, with training genes with low GC content or with training genes with high GC content

Summary

Introduction

Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. Most widely used gene prediction programs depend on Hidden Markov Models (HMMs) to predict gene structure within genomic sequence [1,2,3]. Genes are modeled within HMMs using a series of hidden states that represent generic gene structure. The bimodal distribution of GC-content in the grasses suggests that there exist two classes of genes (high GC and low GC) that the gene prediction programs are attempting to learn

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 25, 2017
Citations: 19	License type: open-access

R Discovery Prime

R Discovery Prime

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Survival Strategies of High GC-Content Microorganisms in Oligotrophic Deep Groundwater
Maryam Rezaei Somee ... Mark Dopson
ARPHA Conference Abstracts | VOL. 6
Maryam Rezaei Somee, et. al.Maryam Rezaei Somee ... Mark Dopson
13 Oct 2023
ARPHA Conference Abstracts | VOL. 6

Evidence of selectively driven codon usage in rice: Implications for GC content evolution of Gramineae genes
Xingyi Guo ... Longjiang Fan
FEBS Letters | VOL. 581
Xingyi Guo, et. al.Xingyi Guo ... Longjiang Fan
08 Feb 2007
FEBS Letters | VOL. 581

CNN-MGP: Convolutional Neural Networks for Metagenomics Gene Prediction
Amani Al-Ajlan ... Achraf El Allali
Interdisciplinary sciences, computational life sciences | VOL. 11
Amani Al-Ajlan, et. al.Amani Al-Ajlan ... Achraf El Allali
27 Dec 2018
Interdisciplinary sciences, computational life sciences | VOL. 11

The Influence of Nucleotide Sequence and Temperature on the Activity of Thermostable DNA Polymerases
Jesse L Montgomery ... Carl T Wittwer
The Journal of Molecular Diagnostics | VOL. 16
Jesse L Montgomery, et. al.Jesse L Montgomery ... Carl T Wittwer
06 Mar 2014
The Journal of Molecular Diagnostics | VOL. 16

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics