The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study.

Daniel A Dalquen,Gaston H Gonnet,Christophe Dessimoz,Adrian M Altenhoff,Liran Carmel

doi:10.1371/journal.pone.0056925

Daniel A Dalquen, Gaston H Gonnet + Show 3 more

Open Access

https://doi.org/10.1371/journal.pone.0056925

Copy DOI

Abstract

The identification of orthologous genes, a prerequisite for numerous analyses in comparative and functional genomics, is commonly performed computationally from protein sequences. Several previous studies have compared the accuracy of orthology inference methods, but simulated data has not typically been considered in cross-method assessment studies. Yet, while dependent on model assumptions, simulation-based benchmarking offers unique advantages: contrary to empirical data, all aspects of simulated data are known with certainty. Furthermore, the flexibility of simulation makes it possible to investigate performance factors in isolation of one another.Here, we use simulated data to dissect the performance of six methods for orthology inference available as standalone software packages (Inparanoid, OMA, OrthoInspector, OrthoMCL, QuartetS, SPIMAP) as well as two generic approaches (bidirectional best hit and reciprocal smallest distance). We investigate the impact of various evolutionary forces (gene duplication, insertion, deletion, and lateral gene transfer) and technological artefacts (ambiguous sequences) on orthology inference. We show that while gene duplication/loss and insertion/deletion are well handled by most methods (albeit for different trade-offs of precision and recall), lateral gene transfer disrupts all methods. As for ambiguous sequences, which might result from poor sequencing, assembly, or genome annotation, we show that they affect alignment score-based orthology methods more strongly than their distance-based counterparts.

Highlights

Two genes occurring in different species are called orthologous if they evolved from a single gene in the last common ancestor, whereas paralogous genes arise by gene duplication [1]
A variety of methods for orthology inference has been developed over the last decade [6,7,8,9,10], but validation of these methods is inherently difficult for the same reasons that lead to their development: the precise evolutionary history of almost all sequence data observed today is unknown
We computed orthology based on simple best bidirectional hits (BBH) and reciprocal shortest distance (RSD)

Summary

Introduction

Two genes occurring in different species are called orthologous if they evolved from a single gene in the last common ancestor, whereas paralogous genes arise by gene duplication [1]. Attempts used conservation of functional aspects, such as gene expression, protein-protein interaction, or Gene Ontology annotations, as indicators of orthology [11,12] This approach is open to debate, as orthology is solely defined by the evolutionary history of the genes, and the relation between evolution and function is not straightforward [3]. To address this problem, tests of phylogenetic congruence between orthologs and reference species tree have been pursued [12]. There has been interest in the community for defining reference datasets for benchmarking orthology inference methods [14], for instance using the Yeast Gene Order Browser as a source for highly curated datasets [15], or by building sets of ‘‘Gold standard’’ reconciled gene/species trees [16,17]

Methods

Results

Conclusion