Abstract

BackgroundSpecies tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. However, estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Maximum likelihood and Bayesian MCMC methods can potentially result in accurate trees, but they do not scale well to large datasets.ResultsWe present STELAR (Species Tree Estimation by maximizing tripLet AgReement), a new fast and highly accurate statistically consistent coalescent-based method for estimating species trees from a collection of gene trees. We formalized the constrained triplet consensus (CTC) problem and showed that the solution to the CTC problem is a statistically consistent estimate of the species tree under the multi-species coalescent (MSC) model. STELAR is an efficient dynamic programming based solution to the CTC problem which is highly accurate and scalable. We evaluated the accuracy of STELAR in comparison with SuperTriplets, which is an alternate fast and highly accurate triplet-based supertree method, and with MP-EST and ASTRAL – two of the most popular and accurate coalescent-based methods. Experimental results suggest that STELAR matches the accuracy of ASTRAL and improves on MP-EST and SuperTriplets.ConclusionsTheoretical and empirical results (on both simulated and real biological datasets) suggest that STELAR is a valuable technique for species tree estimation from gene tree distributions.

Highlights

  • Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome

  • Results on mammalian simulated dataset We analyzed the performance of SuperTriplets, ASTRALIII, MP-EST and STELAR on various model conditions with varying amounts of incomplete lineage sorting (ILS), numbers of genes and lengths of the sequences

  • For the dataset with varying amounts of ILS, SuperTriplets produced trees with RF rates 10% ∼18%, whereas the error rates of STELAR, MP-EST and ASTRAL-III range from 4% ∼ 6%

Read more

Summary

Introduction

Species tree estimation is frequently based on phylogenomic approaches that use multiple genes from throughout the genome. Estimating a species tree from a collection of gene trees can be complicated due to the presence of gene tree incongruence resulting from incomplete lineage sorting (ILS), which is modelled by the multi-species coalescent process. Species tree inference can potentially result in accurate evolutionary history using data from multiple loci. Combining multi-locus data is difficult, especially in the presence of gene tree discordance [1]. Recent modeling and computational advances have produced methods that explicitly take the gene tree discordance into account while combining multi-locus data to estimate species trees. Incomplete lineage sorting (ILS) ( known as deep coalescence) is one of the most prevalent reasons for gene tree incongruence [1], which is modelled by the MSC [7].

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call