Abstract

BackgroundAccurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. When gene prediction programs are trained on a subset of grass genes with random GC content, they are effectively being trained on two classes of genes at once, and this can be expected to result in poor results when genes are predicted in new genome sequences.ResultsWe find that gene prediction programs trained on grass genes with random GC content do not completely predict all grass genes with extreme GC content. We show that gene prediction programs that are trained with grass genes with high or low GC content can make both better and unique gene predictions compared to gene prediction programs that are trained on genes with random GC content. By separately training gene prediction programs with genes from multiple GC ranges and using the programs within the MAKER genome annotation pipeline, we were able to improve the annotation of the Oryza sativa genome compared to using the standard MAKER annotation protocol. Gene structure was improved in over 13% of genes, and 651 novel genes were predicted by the GC-specific MAKER protocol.ConclusionsWe present a new GC-specific MAKER annotation protocol to predict new and improved gene models and assess the biological significance of this method in Oryza sativa. We expect that this protocol will also be beneficial for gene prediction in any organism with bimodal or other unusual gene GC content.

Highlights

  • Accurate structural annotation depends on well-trained gene prediction programs

  • Reannotation of the O. sativa genome with MAKER using Hidden Markov Model (HMM) trained on high and low GC content We thought that grass genes identified by gene prediction programs that are trained on genes with specific GC content could both find different genes and produce differing gene models at identical loci than prediction programs that are trained on genes with random GC content

  • SNAP and AUGUSTUS HMMS were trained either with training genes randomly picked without regard for GC content, with training genes with low GC content or with training genes with high GC content

Read more

Summary

Introduction

Accurate structural annotation depends on well-trained gene prediction programs. Training data for gene prediction programs are often chosen randomly from a subset of high-quality genes that ideally represent the variation found within a genome. One aspect of gene variation is GC content, which differs across species and is bimodal in grass genomes. Most widely used gene prediction programs depend on Hidden Markov Models (HMMs) to predict gene structure within genomic sequence [1,2,3]. Genes are modeled within HMMs using a series of hidden states that represent generic gene structure. The bimodal distribution of GC-content in the grasses suggests that there exist two classes of genes (high GC and low GC) that the gene prediction programs are attempting to learn

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call