Abstract
BackgroundConstructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics. One difficulty is that the underlying genes can be incongruent due to evolutionary processes such as gene duplication and loss, deep coalescence, or lateral gene transfer. Gene tree estimation errors may further exacerbate the difficulties of species tree estimation.ResultsWe present a new approach for inferring species trees from incongruent multi-copy gene trees that is based on a generalization of the Robinson-Foulds (RF) distance measure to multi-labeled trees (mul-trees). We prove that it is NP-hard to compute the RF distance between two mul-trees; however, it is easy to calculate this distance between a mul-tree and a singly-labeled species tree. Motivated by this, we formulate the RF problem for mul-trees (MulRF) as follows: Given a collection of multi-copy gene trees, find a singly-labeled species tree that minimizes the total RF distance from the input mul-trees. We develop and implement a fast SPR-based heuristic algorithm for the NP-hard MulRF problem.We compare the performance of the MulRF method (available at http://genome.cs.iastate.edu/CBL/MulRF/) with several gene tree parsimony approaches using gene tree simulations that incorporate gene tree error, gene duplications and losses, and/or lateral transfer. The MulRF method produces more accurate species trees than gene tree parsimony approaches. We also demonstrate that the MulRF method infers in minutes a credible plant species tree from a collection of nearly 2,000 gene trees.ConclusionsOur new phylogenetic inference method, based on a generalized RF distance, makes it possible to quickly estimate species trees from large genomic data sets. Since the MulRF method, unlike gene tree parsimony, is based on a generic tree distance measure, it is appealing for analyses of genomic data sets, in which many processes such as deep coalescence, recombination, gene duplication and losses as well as phylogenetic error may contribute to gene tree discord. In experiments, the MulRF method estimated species trees accurately and quickly, demonstrating MulRF as an efficient alternative approach for phylogenetic inference from large-scale genomic data sets.
Highlights
Constructing species trees from multi-copy gene trees remains a challenging problem in phylogenetics
We examined the accuracy of species tree estimates by Only-dup, Dup-loss, and SPRS when gene tree simulations include events that these methods do not assume to be the source of discordance
We presented the new MulRF method for inferring species tree from incongruent gene trees that is based on a generalized form of the RF distance
Summary
We report the average topological error (ATE) for each model condition. This is the average of the normalized RF distance (dividing the RF distance by number of internal edges in both trees) between each of the 20 model species trees and their estimated species trees. In order to examine how Only-dup, Dup-loss, and SPRS methods perform when gene tree simulations only include events that these methods assume to be the source of discordance, we studied the performance of Only-dup and Dup-loss on evolutionary scenario dl and SPRS on lgt In both evolutionary scenarios we found that the ATE rate of MulRF was lowest for both 50- and 100-taxon data sets (Figure 8). We examined the accuracy of species tree estimates by Only-dup, Dup-loss, and SPRS when gene tree simulations include events that these methods do not assume to be the source of discordance (e.g., dl and both for SPRS, lgt and both for Only-dup and Duploss). The Dup-loss tree, which completed in 7 seconds, had a less accepted topology, placing Amborella sister to the monocots instead of sister to other angiosperms and Figure 8 Experimental result of simulated data. The ATE for the Dup-loss tree is 0.21
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have