Abstract

Because biological processes can result in different loci having different evolutionary histories, species tree estimation requires multiple loci from across multiple genomes. While many processes can result in discord between gene trees and species trees, incomplete lineage sorting (ILS), modeled by the multi-species coalescent, is considered to be a dominant cause for gene tree heterogeneity. Coalescent-based methods have been developed to estimate species trees, many of which operate by combining estimated gene trees, and so are called "summary methods". Because summary methods are generally fast (and much faster than more complicated coalescent-based methods that co-estimate gene trees and species trees), they have become very popular techniques for estimating species trees from multiple loci. However, recent studies have established that summary methods can have reduced accuracy in the presence of gene tree estimation error, and also that many biological datasets have substantial gene tree estimation error, so that summary methods may not be highly accurate in biologically realistic conditions. Mirarab et al. (Science 2014) presented the "statistical binning" technique to improve gene tree estimation in multi-locus analyses, and showed that it improved the accuracy of MP-EST, one of the most popular coalescent-based summary methods. Statistical binning, which uses a simple heuristic to evaluate "combinability" and then uses the larger sets of genes to re-calculate gene trees, has good empirical performance, but using statistical binning within a phylogenomic pipeline does not have the desirable property of being statistically consistent. We show that weighting the re-calculated gene trees by the bin sizes makes statistical binning statistically consistent under the multispecies coalescent, and maintains the good empirical performance. Thus, "weighted statistical binning" enables highly accurate genome-scale species tree estimation, and is also statistically consistent under the multi-species coalescent model. New data used in this study are available at DOI: http://dx.doi.org/10.6084/m9.figshare.1411146, and the software is available at https://github.com/smirarab/binning.

Highlights

  • The estimation of phylogenetic trees, whether of individual loci or at the genome-level, is a basic step in many biological analyses [1]

  • We evaluated the impact of statistical binning on gene tree estimation error for the 1X model condition, with sequence lengths varying from 250bp to 1500bp

  • We express these results using a cumulative distribution over all possible triplets and all replicates; if a curve for one method lies above the curve for another method, the first method strictly improves on the second method with respect to estimating the gene tree distribution

Read more

Summary

Introduction

The estimation of phylogenetic trees, whether of individual loci (so called “gene trees”) or at the genome-level (species trees), is a basic step in many biological analyses [1]. When ILS occurs, standard methods for estimating species trees, such as concatenation (which combines sequence alignments from different loci into a single “supermatrix”, and computes a tree on the supermatrix) and consensus methods, can be statistically inconsistent [6, 7], and produce highly supported but incorrect trees [8]. Because these standard methods for estimating species trees from multiple loci can be positively misleading in the presence of gene tree heterogeneity due to ILS, statistical methods (e.g., [9,10,11,12,13]) have been developed to estimate the species tree assuming all gene tree heterogeneity is due to ILS and, in particular, not to poor phylogenetic signal

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call