The impact of paralogy on phylogenomic studies - a case study on annelid relationships.

Torsten H Struck,Zhanjiang Liu

doi:10.1371/journal.pone.0062892

Abstract

Phylogenomic studies based on hundreds of genes derived from expressed sequence tags libraries are increasingly used to reveal the phylogeny of taxa. A prerequisite for these studies is the assignment of genes into clusters of orthologous sequences. Sophisticated methods of orthology prediction are used in such analyses, but it is rarely assessed whether paralogous sequences have been erroneously grouped together as orthologous sequences after the prediction, and whether this had an impact on the phylogenetic reconstruction using a super-matrix approach. Herein, I tested the impact of paralogous sequences on the reconstruction of annelid relationships based on phylogenomic datasets. Using single-partition analyses, screening for bootstrap support, blast searches and pruning of sequences in the supermatrix, wrongly assigned paralogous sequences were found in eight partitions and the placement of five taxa (the annelids Owenia, Scoloplos, Sthenelais and Eurythoe and the nemertean Cerebratulus) including the robust bootstrap support could be attributed to the presence of paralogous sequences in two partitions. Excluding these sequences resulted in a different, weaker supported placement for these taxa. Moreover, the analyses revealed that paralogous sequences impacted the reconstruction when only a single taxon represented a previously supported higher taxon such as a polychaete family. One possibility of a priori detection of wrongly assigned paralogous sequences could combine 1) a screening of single-partition analyses based on criteria such as nodal support or internal branch length with 2) blast searches of suspicious cases as presented herein. Also possible are a posteriori approaches in which support for specific clades is investigated by comparing alternative hypotheses based on differences in per-site likelihoods. Increasing the sizes of EST libraries will also decrease the likelihood of wrongly assigned paralogous sequences, and in the case of orthology prediction methods like HaMStR it is likewise decreased by using more than one reference taxon.

Highlights

Molecular phylogenetics has gone through tremendous changes in the last decade with respect to the amount of data used for phylogenetic reconstructions
Further bioinformatic processing was conducted as described in Struck et al [26] including assembly of expressed sequence tags (EST) data into contigs, quality trimming of sequences, orthology prediction using either the human ribosomal proteome for local blast searches or the program HaMStR [31], and translation into amino acids using ESTwise [61]
The 79 human ribosomal proteins were retrieved from the Ribosomal Protein Gene Database [62] and blast searches against the O. fusiformis EST library were conducted

Summary

Introduction

Molecular phylogenetics has gone through tremendous changes in the last decade with respect to the amount of data used for phylogenetic reconstructions. The most common approach in phylogenomics is to utilize expressed sequence tags (EST) libraries (e.g., [23,25,26,27,28,29]). This means that the transcriptome of a specimen (or tissues of the specimen) is randomly sequenced. A crucial step in this a posteriori selection process is the determination of orthologous genes across the different libraries of the analysis (e.g., [31]). The sequences of the EST libraries of the different taxa are grouped together into clusters of sequences of the same orthologous gene

Methods

Results

Discussion

Conclusion