Abstract

The multispecies coalescent model provides a natural framework for species tree estimation accounting for gene-tree conflicts. Although a number of species tree methods under the multispecies coalescent have been suggested and evaluated using simulation, their statistical properties remain poorly understood. Here, we use mathematical analysis aided by computer simulation to examine the identifiability, consistency, and efficiency of different species tree methods in the case of three species and three sequences under the molecular clock. We consider four major species-tree methods including concatenation, two-step, independent-sites maximum likelihood, and maximum likelihood. We develop approximations that predict that the probit transform of the species tree estimation error decreases linearly with the square root of the number of loci. Even in this simplest case, major differences exist among the methods. Full-likelihood methods are considerably more efficient than summary methods such as concatenation and two-step. They also provide estimates of important parameters such as species divergence times and ancestral population sizes,whereas these parameters are not identifiable by summary methods. Our results highlight the need to improve the statistical efficiency of summary methods and the computational efficiency of full likelihood methods of species tree estimation.

Highlights

  • Two lines of research into the multispecies coalescent (MSC) have provided the foundation for species tree methods

  • The gene tree distribution can be used in the two-step method of species tree estimation, by inferring gene trees for the individual loci and applying maximum likelihood (ML) to counts of gene tree topologies

  • We consider four inference methods: (i) ML, (ii) 2-STEP, (iii) concatenation (CONCAT), and (iv) independent-sites ML (ISML, known as coalescent-aware concatenation or CONCAT) (Xu and Yang 2016)

Read more

Summary

Introduction

Two lines of research into the MSC have provided the foundation for species tree methods. In the case considered here, concatenation is equivalent to STEAC (Liu et al 2009), which uses average coalescent times over loci as data to infer a gene tree, which is the species tree estimate. ISML (or CONCAT) estimates the species tree by ML under the assumption that all sites, both from the same locus and from different loci, have independent gene trees (Xu and Yang 2016) This was suggested as an improvement to SVDQUARTETS of Chifman and Kubatko (2014). Are considered heuristic summary methods: 2-STEP uses the (estimated) gene tree topologies, while concatenation and ISML use the site-pattern counts pooled across loci. We use the theory to characterize the differences in the use of information in the data by different methods

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call