Abstract

BackgroundMany methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. However, few studies have attempted to use analytical theory to quantify “sufficiently large”.ResultsUsing the multispecies coalescent model, we report a general analytical upper bound on the number of gene trees n required such that with probability q, each bipartition of a species tree is represented at least once in a set of n random gene trees. This bound employs a formula that is straightforward to compute, depends only on the minimum internal branch length of the species tree and the number of taxa, and applies irrespective of the species tree topology. Using simulations, we investigate numerical properties of the bound as well as its accuracy under the multispecies coalescent.ConclusionsOur results are helpful for conservatively bounding the number of gene trees required by the ASTRAL inference method, and the approach has potential to be extended to bound other properties of gene tree sets under the model.

Highlights

  • Many methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates

  • Conditional on the species tree, not all topologies are likely for the gene tree; a random sample of gene trees that have evolved on the species tree contains information about the species tree topology and branch lengths [24]

  • For gene trees and species trees sharing the same set of k taxa, our goal is to study the probability that a random gene tree set G containing n gene trees sampled under the multispecies coalescent model is a bipartition cover of a species tree S

Read more

Summary

Introduction

Many methods for species tree inference require data from a sufficiently large sample of genomic loci in order to produce accurate estimates. Consensus and summary methods are often chosen in species tree inference studies because they typically have desirable properties, including computational efficiency, scalability to trees with many taxa, and conceptual simplicity. This latter feature makes such methods suitable for mathematical analyses, many of which have emphasized the property of statistical consistency under a standard model for gene tree evolution, the multispecies coalescent [3]. As the number of sampled gene trees increases, the probability that the species tree estimate from a random sample of gene trees produced

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call