ORFcor: Identifying and Accommodating ORF Prediction Inconsistencies for Phylogenetic Analysis

Jonathan L Klassen,Cameron R Currie,Jonathan H Badger

doi:10.1371/journal.pone.0058387

Jonathan L Klassen, Cameron R Currie + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0058387

Copy DOI

Abstract

The high-throughput annotation of open reading frames (ORFs) required by modern genome sequencing projects necessitates computational protocols that sometimes annotate orthologous ORFs inconsistently. Such inconsistencies hinder comparative analyses by non-uniformly extending or truncating 5′ and/or 3′ sequence ends, causing ORFs that are in fact identical to artificially diverge. Whereas strategies exist to correct such inconsistencies during whole-genome annotation, equivalent software designed to correct subsets of these data without genome reannotation is lacking. We therefore developed ORFcor, which corrects annotation inconsistencies using consensus start and stop positions derived from sets of closely related orthologs. ORFcor corrects inconsistent ORF annotations in diverse test datasets with specificities and sensitivities approaching 100% when sufficiently related orthologs (e.g., from the same taxonomic family) are available for comparison. The ORFcor package is implemented in Perl, multithreaded to handle large datasets, includes related scripts to facilitate high-throughput phylogenomic analyses, and is freely available at www.currielab.wisc.edu/downloads.html.

Highlights

Recent technical advances have promoted the proliferation of genome sequencing projects, leading to the accumulation of extensive genome-scale sequence data in public databases
A wide range of algorithm parameters were examined on data having different open reading frames (ORFs) annotation inconsistency frequencies and sizes, and representative results are shown in Table 1
This rate of ORF annotation inconsistencies is likely an underestimate given the slow evolutionary rate of many of the proteins used in our dataset, and would be higher still if draft-quality genomes were included

Summary

Introduction

Recent technical advances have promoted the proliferation of genome sequencing projects, leading to the accumulation of extensive genome-scale sequence data in public databases This has in turn facilitated routine, large scale comparative analyses of functional and taxonomic diversity, i.e., "phylogenomics" [1,2]. Most phylogenomic approaches require comparisons between genes or proteins, e.g., to determine homology, identify orthologous, paralogous and xenologous relationships, and conduct phylogenetic analysis Such analyses assume that their input data are directly comparable, i.e., a sequence that is truly 100% identical in two genomes will exist in exactly identical copies in each genome. Whereas orthologous ORFs may truly differ in structure (e.g., due to multiple unique start or stop sites, programmed frameshifts, or pseudogenization [6]), differentiating such genuine variation from sequencing or annotation errors is difficult without experimental validation. The result of any of these inconsistencies is that two truly identical sequences will artificially differ due to ORF truncation, extension, and/or the incorrect incorporation of sequence not belonging to that ORF, thereby potentially confounding further analysis

Methods

Results

Conclusion