Abstract
Summary: Genomes of emerging model organisms are now being sequenced at very low cost. However, obtaining accurate gene predictions remains challenging: even the best gene prediction algorithms make substantial errors and can jeopardize subsequent analyses. Therefore, many predicted genes must be time-consumingly visually inspected and manually curated. We developed GeneValidator (GV) to automatically identify problematic gene predictions and to aid manual curation. For each gene, GV performs multiple analyses based on comparisons to gene sequences from large databases. The resulting report identifies problematic gene predictions and includes extensive statistics and graphs for each prediction to guide manual curation efforts. GV thus accelerates and enhances the work of biocurators and researchers who need accurate gene predictions from newly sequenced genomes.Availability and implementation: GV can be used through a web interface or in the command-line. GV is open-source (AGPL), available at https://wurmlab.github.io/tools/genevalidator.Contact: y.wurm@qmul.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online.
Highlights
The plummeting costs of DNA sequencing (Wetterstrand, 2015) have made de novo genome sequencing accessible to individual laboratories and even researchers (Nygaard and Wurm, 2015)
Summary: Genomes of emerging model organisms are being sequenced at very low cost
We developed GeneValidator (GV) to automatically identify problematic gene predictions and to aid manual curation
Summary
The plummeting costs of DNA sequencing (Wetterstrand, 2015) have made de novo genome sequencing accessible to individual laboratories and even researchers (Nygaard and Wurm, 2015). Relevant sources include protein-coding sequences from other organisms, or peptide or transcriptome sequences from the organism being studied. Modern algorithms combine both approaches (Cantarel et al, 2008; Korf, 2004; Stanke et al, 2008). Automated gene prediction quality evaluation tools analyze exon boundaries (Eilbeck et al, 2009; Yandell and Ence, 2012) or focus on subsets of highly conserved genes (Parra et al, 2007). Such tools ignore most of the information present in frequently updated databases such as SwissProt or Genbank NR. GV provides quality evaluations in text formats for automated analysis and in highly visual formats for inspection by researchers
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have