A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms

Nicolas Scalzitti,Julie D Thompson,Olivier Poch,Pierre Collet,Anne Jeannin-Girardon

doi:10.1186/s12864-020-6707-9

Nicolas Scalzitti, Julie D Thompson + Show 3 more

Open Access

https://doi.org/10.1186/s12864-020-6707-9

Copy DOI

Abstract

BackgroundThe draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. New benchmark methods are needed to evaluate the accuracy of gene prediction methods in the face of incomplete genome assemblies, low genome coverage and quality, complex gene structures, or a lack of suitable sequences for evidence-based annotations.ResultsWe describe the construction of a new benchmark, called G3PO (benchmark for Gene and Protein Prediction PrOgrams), designed to represent many of the typical challenges faced by current genome annotation projects. The benchmark is based on a carefully validated and curated set of real eukaryotic genes from 147 phylogenetically disperse organisms, and a number of test sets are defined to evaluate the effects of different features, including genome sequence quality, gene structure complexity, protein length, etc. We used the benchmark to perform an independent comparative analysis of the most widely used ab initio gene prediction programs and identified the main strengths and weaknesses of the programs. More importantly, we highlight a number of features that could be exploited in order to improve the accuracy of current prediction tools.ConclusionsThe experiments showed that ab initio gene structure prediction is a very challenging task, which should be further investigated. We believe that the baseline results associated with the complex gene test sets in G3PO provide useful guidelines for future studies.

Highlights

The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models
Benchmark data sets The G3PO benchmark contains 1793 proteins from a diverse set of organisms (Additional file 1: Table S1), which can be used for the evaluation of gene prediction programs
The proteins were extracted from the Uniprot [34] database, and are divided into 20 orthologous families that are representative of complex proteins, with multiple functional domains, repeats and low complexity regions (Additional file 1: Table S2)

Summary

Introduction

The draft genome assemblies produced by new sequencing technologies present important challenges for automatic gene prediction pipelines, leading to less accurate gene models. Information from closely related genomes can be exploited, in order to transfer known gene models to the target genome. Scalzitti et al BMC Genomics (2020) 21:293 that incorporate similarity information, either from transcriptome data or known gene models, including GenomeScan [8], GeneWise [9], FGENESH [10], Augustus [11], Splign [12], CodingQuarry [13], and LoReAN [14]. The main limitation of similarity-based approaches is in cases where transcriptome sequences or closely related genomes are not available. Such approaches encourage the propagation of erroneous annotations across genomes and cannot be used to discover novelty [5]. Ab initio gene predictors, such as Genscan [16], GlimmerHMM [17], GeneID [18], FGENESH [10], Snap [19], Augustus [20], and GeneMark-ES [21], can be used to identify previously unknown genes or genes that have evolved beyond the limits of similarity-based approaches

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Genomics	Publication Date: Apr 9, 2020
Citations: 64	License type: open-access

R Discovery Prime

R Discovery Prime

A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics

Lead the way for us

Similar Papers

Annotating Viral Genomes - A Cannon is Needed to Kill Mosquitoes
Shiliang Wang
Current Bioinformatics | VOL. 9
Shiliang WangShiliang Wang
31 Mar 2014
Current Bioinformatics | VOL. 9

Comparative gene prediction in human and mouse.
Genı́S Parra ... Pankaj Agarwal
Genome Research | VOL. 13
Genı́S Parra, et. al.Genı́S Parra ... Pankaj Agarwal
01 Jan 2003
Genome Research | VOL. 13

A modified GC-specific MAKER gene annotation method reveals improved and novel gene predictions of high and low GC content in Oryza sativa
Megan J Bowman ... Tiffany L Liu
BMC Bioinformatics | VOL. 18
Megan J Bowman, et. al.Megan J Bowman ... Tiffany L Liu
25 Nov 2017
BMC Bioinformatics | VOL. 18

Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.
Mario Stanke ... Oliver Schöffmann
BMC Bioinformatics | VOL. 7
Mario Stanke, et. al.Mario Stanke ... Oliver Schöffmann
09 Feb 2006
BMC Bioinformatics | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Genomics