Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

Kuang-Lim Chan,Mohd Firdaus-Raih,Tatiana V Tatarinova,Eng-Ti Leslie Low,Michael Hogan,Rozana Rosli

doi:10.1186/s12859-016-1426-6

Abstract

BackgroundGene prediction is one of the most important steps in the genome annotation process. A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. However, these systems have yet to accurately predict all or even most of the protein-coding regions. Furthermore, none of the currently available gene-finders has a universal Hidden Markov Model (HMM) that can perform gene prediction for all organisms equally well in an automatic fashion.ResultsWe present an automated gene prediction pipeline, Seqping that uses self-training HMM models and transcriptomic data. The pipeline processes the genome and transcriptome sequences of the target species using GlimmerHMM, SNAP, and AUGUSTUS pipelines, followed by MAKER2 program to combine predictions from the three tools in association with the transcriptomic evidence. Seqping generates species-specific HMMs that are able to offer unbiased gene predictions. The pipeline was evaluated using the Oryza sativa and Arabidopsis thaliana genomes. Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis showed that the pipeline was able to identify at least 95% of BUSCO’s plantae dataset. Our evaluation shows that Seqping was able to generate better gene predictions compared to three HMM-based programs (MAKER2, GlimmerHMM and AUGUSTUS) using their respective available HMMs. Seqping had the highest accuracy in rice (0.5648 for CDS, 0.4468 for exon, and 0.6695 nucleotide structure) and A. thaliana (0.5808 for CDS, 0.5955 for exon, and 0.8839 nucleotide structure).ConclusionsSeqping provides researchers a seamless pipeline to train species-specific HMMs and predict genes in newly sequenced or less-studied genomes. We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available HMMs.

Highlights

Gene prediction is one of the most important steps in the genome annotation process
We conclude that the Seqping pipeline predictions are more accurate than gene predictions using the other three approaches with the default or available Hidden Markov Model (HMM)
The three main gene finders: GlimmerHMM, AUGUSTUS, and SNAP, have pre-build HMM models for several model species in their software packages, but the available existing HMMs may not be suitable for highly complex plant genomes

Summary

Introduction

A large number of software tools and pipelines developed by various computing techniques are available for gene prediction. These systems have yet to accurately predict all or even most of the protein-coding regions. Rapid and cost-effective next-generation sequencing (NGS) technologies produce large volumes of DNA sequencing data in large-scale genome projects. These advances enabled the research community to sequence. Gene finders are often trained using known gene models and this leads to biases in gene structure [12,13,14] None of these systems incorporates a flexible, universal gene model that can perform gene prediction for a wide range of species. Available gene finders do not accurately predict most of the protein-coding regions [15], and predicting the complete set of an organism’s protein-coding genes remains a significant challenge

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Jan 1, 2017
Citations: 27	License type: open-access

R Discovery Prime

R Discovery Prime

Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

An Improved Genome Sequence Resource of Bipolaris maydis, Causal Agent of Southern Corn Leaf Blight.
Yafei Wang ... Houxiang Kang
Phytopathology® | VOL. 112
Yafei Wang, et. al.Yafei Wang ... Houxiang Kang
29 Apr 2022
Phytopathology® | VOL. 112

SnowyOwl: accurate prediction of fungal genes by using RNA-Seq and homology information to select among ab initio models.
Ian Reid ... Adrian Tsang
BMC Bioinformatics | VOL. 15
Ian Reid, et. al.Ian Reid ... Adrian Tsang
01 Jul 2014
BMC Bioinformatics | VOL. 15

Genome Sequence Resource of Ustilago crameri, a Fungal Pathogen Causing Millet Smut Disease of Foxtail Millet.
Yuwei Liu ... Jiayue Liu
Plant disease | VOL. 107
Yuwei Liu, et. al.Yuwei Liu ... Jiayue Liu
04 Jan 2023
Plant disease | VOL. 107

Credit scheduling and prefetching in hypervisors using Hidden Markov Models
Vidya Suryanarayana ... Ravi Pendse
-
Vidya Suryanarayana, et. al.Vidya Suryanarayana ... Ravi Pendse
01 Oct 2010
01 Oct 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Seqping: gene prediction pipeline for plant genomes using self-training gene models and transcriptomic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics