Abstract

Summary: Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging: even the best gene prediction algorithms make substantial errors and can jeopardize subsequent analyses. Therefore, many predicted genes must be time-consumingly visually inspected and manually curated. We developed GeneValidator (GV) to automatically identify problematic gene predictions and to aid manual curation. For each gene, GV performs multiple analyses based on comparisons to gene sequences from large databases. The resulting report identifies problematic gene predictions and includes extensive statistics and graphs for each prediction to guide manual curation efforts. GV thus accelerates and enhances the work of biocurators and researchers who need accurate gene predictions from newly sequenced genomes.Availability and implementation: GV can be used through a web interface or in the command-line. GV is open-source (AGPL), available at https://wurmlab.github.io/tools/genevalidator.Contact: y.wurm@qmul.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.

Highlights

  • The plummeting costs of DNA sequencing (Wetterstrand, 2015) have made de novo genome sequencing accessible to individual laboratories and even researchers (Nygaard and Wurm, 2015)

  • Summary: Genomes of emerging model organisms are being sequenced at very low cost

  • We developed GeneValidator (GV) to automatically identify problematic gene predictions and to aid manual curation

Read more

Summary

Introduction

The plummeting costs of DNA sequencing (Wetterstrand, 2015) have made de novo genome sequencing accessible to individual laboratories and even researchers (Nygaard and Wurm, 2015). Relevant sources include protein-coding sequences from other organisms, or peptide or transcriptome sequences from the organism being studied. Modern algorithms combine both approaches (Cantarel et al, 2008; Korf, 2004; Stanke et al, 2008). Automated gene prediction quality evaluation tools analyze exon boundaries (Eilbeck et al, 2009; Yandell and Ence, 2012) or focus on subsets of highly conserved genes (Parra et al, 2007). Such tools ignore most of the information present in frequently updated databases such as SwissProt or Genbank NR. GV provides quality evaluations in text formats for automated analysis and in highly visual formats for inspection by researchers

Approach
Discussion
Future work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call