Abstract

BackgroundGenome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction.ResultsHere, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions.ConclusionsGeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa.

Highlights

  • Genome annotation is of key importance in many research questions

  • Genome annotation pipelines utilize three main sources of information, namely evidence from wet-lab transcriptome studies [1, 2], ab-initio gene prediction based on general features of genes [3, 4], and homology-based gene prediction relying on gene models of related, well-annotated species [5,6,7]

  • We find that the performance of CodingQuarry is highly sensitive to RNA-seq processing, whereas the performance of Gene Model Mapper (GeMoMa) is barely affected (Additional file 1: Table S5)

Read more

Summary

Introduction

The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. The annotation of protein-coding genes is of critical importance for many fields of biological research including, for instance, comparative genomics, functional proteomics, gene targeting, genome editing, phylogenetics, transcriptomics, and phylostratigraphy. Genome annotation pipelines utilize three main sources of information, namely evidence from wet-lab transcriptome studies [1, 2], ab-initio gene prediction based on general features of (protein-coding) genes [3, 4], and homology-based gene prediction relying on gene models of (closely) related, well-annotated species [5,6,7]. Experimental data allow for inferring coverage of gene predictions and splice sites bordering their exons, which may assist computational ab-initio or homology-based approaches. RNA-seq data is available for a wide range of organisms, tissues and environmental conditions, and can be utilized for genome annotation pipelines

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.