Abstract

BackgroundEstimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. Methods to estimate species trees in the presence of gene tree discord due to incomplete lineage sorting have been developed and proved to be statistically consistent when gene tree discord is due only to incomplete lineage sorting and every gene tree includes the full set of species.ResultsWe establish statistical consistency of certain coalescent-based species tree estimation methods under some models of taxon deletion from genes. We also evaluate the impact of missing data on four species tree estimation methods (ASTRAL-II, ASTRID, MP-EST, and SVDquartets) using simulated datasets with varying levels of incomplete lineage sorting, gene tree estimation error, and degrees/patterns of missing data.ConclusionsAll the species tree estimation methods improved in accuracy as the number of genes increased and often produced highly accurate species trees even when the amount of missing data was large. These results together indicate that accurate species tree estimation is possible under a variety of conditions, even when there are substantial amounts of missing data.

Highlights

  • Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny

  • Our preliminary experiment evaluated the impact of missing data on the inputs given to species tree estimation methods

  • Many species tree estimation methods have been shown to reduce in accuracy as taxa are deleted from gene trees, leading to the concern that species tree estimation methods might not be statistically consistent under models of taxon deletion

Read more

Summary

Introduction

Estimation of species trees from multiple genes is complicated by processes such as incomplete lineage sorting, gene duplication and loss, and horizontal gene transfer, that result in gene trees that differ from each other and from the species phylogeny. The estimation of a species phylogeny from multiple loci is confounded by biological processes, such as horizontal gene transfer and incomplete lineage sorting, that cause individual gene tree topologies to differ from each other [1] While some of these processes require phylogenetic networks for proper modeling of the species phylogeny, other processes, including incomplete lineage sorting (ILS) and gene duplication and loss, are still consistent with a species tree. Examples of methods for species tree estimation that are statistically consistent under the MSC model include ASTRAL-I [5] and its improved version ASTRAL-II [6], ASTRID [7], *BEAST [8], BEST [9], the population tree in BUCKy [10], GLASS [11], METAL [12], MP-EST [13], NJst [14], SMRT [15], SNAPP [16], STAR [17], STEAC [17], and STEM [18] Some of these methods (e.g., ASTRAL-I, ASTRID, Nute et al BMC Genomics 2018, 19(Suppl 5):286

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call