Abstract

BackgroundSpecies tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Because ILS is expected to occur and the standard concatenation approach can return incorrect trees with high support in the presence of ILS, "coalescent-based" summary methods (which first estimate gene trees and then combine gene trees into a species tree) have been developed that have theoretical guarantees of robustness to arbitrarily high amounts of ILS. Some studies have suggested that summary methods should only be used on "c-genes" (i.e., recombination-free loci) that can be extremely short (sometimes fewer than 100 sites). However, gene trees estimated on short alignments can have high estimation error, and summary methods tend to have high error on short c-genes. To address this problem, Chifman and Kubatko introduced SVDquartets, a new coalescent-based method. SVDquartets takes multi-locus unlinked single-site data, infers the quartet trees for all subsets of four species, and then combines the set of quartet trees into a species tree using a quartet amalgamation heuristic. Yet, the relative accuracy of SVDquartets to leading coalescent-based methods has not been assessed.ResultsWe compared SVDquartets to two leading coalescent-based methods (ASTRAL-2 and NJst), and to concatenation using maximum likelihood. We used a collection of simulated datasets, varying ILS levels, numbers of taxa, and number of sites per locus. Although SVDquartets was sometimes more accurate than ASTRAL-2 and NJst, most often the best results were obtained using ASTRAL-2, even on the shortest gene sequence alignments we explored (with only 10 sites per locus). Finally, concatenation was the most accurate of all methods under low ILS conditions.ConclusionsASTRAL-2 generally had the best accuracy under higher ILS conditions, and concatenation had the best accuracy under the lowest ILS conditions. However, SVDquartets was competitive with the best methods under conditions with low ILS and small numbers of sites per locus. The good performance under many conditions of ASTRAL-2 in comparison to SVDquartets is surprising given the known vulnerability of ASTRAL-2 and similar methods to short gene sequences.

Highlights

  • Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree

  • We address the following questions: 1 How does SVDquartets+PAUP* compare to ASTRAL-2 and NJst, two of the best performing statistically consistent summary methods? 2 How do the statistically consistent methods we study compare to a concatenated analysis using maximum likelihood? 3 How do all the methods perform on short sequences?

  • Tree estimation error rates reduce as the number of genes or sites per gene increase, while they increase as the ILS level increases

Read more

Summary

Introduction

Species tree estimation is challenging in the presence of incomplete lineage sorting (ILS), which can make gene trees different from the species tree. Estimating a species tree from multi-locus sequence data is complicated by biological processes such as gene duplication and loss, hybridization, and incomplete lineage sorting, which make true gene trees different from the Methods for estimating species trees in the presence of ILS have been developed that are provably statistically consistent under the multi-species coalescent model, which means that they will converge in probability to the true species tree as the number of loci and sites per locus increase [4]. CA-ML is not statistically consistent under the multi-species coalescent and can converge to a tree other than the species tree (i.e., be positively misleading) [17]

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call