Abstract

Fully Bayesian multispecies coalescent (MSC) methods like *BEAST estimate species trees from multiple sequence alignments. Today thousands of genes can be sequenced for a given study, but using that many genes with *BEAST is intractably slow. An alternative is to use heuristic methods which compromise accuracy or completeness in return for speed. A common heuristic is concatenation, which assumes that the evolutionary history of each gene tree is identical to the species tree. This is an inconsistent estimator of species tree topology, a worse estimator of divergence times, and induces spurious substitution rate variation when incomplete lineage sorting is present. Another class of heuristics directly motivated by the MSC avoids many of the pitfalls of concatenation but cannot be used to estimate divergence times. To enable fuller use of available data and more accurate inference of species tree topologies, divergence times, and substitution rates, we have developed a new version of *BEAST called StarBEAST2. To improve convergence rates we add analytical integration of population sizes, novel MCMC operators and other optimizations. Computational performance improved by 13.5× and 13.8× respectively when analyzing two empirical data sets, and an average of 33.1× across 30 simulated data sets. To enable accurate estimates of per-species substitution rates, we introduce species tree relaxed clocks, and show that StarBEAST2 is a more powerful and robust estimator of rate variation than concatenation. StarBEAST2 is available through the BEAUTi package manager in BEAST 2.4 and above.

Highlights

  • The throughput of sequencing technologies has improved remarkably over the past two decades culminating in generation sequencing (NGS), and it is feasible to sequence whole or partial genomes or transcriptomes for phylogenetic studies (Lemmon and Lemmon 2013)

  • With the aim of improving the computational performance of fully Bayesian multispecies coalescent (MSC) inference of species trees, we have developed an upgrade to *BEAST—StarBEAST2—which is available as a package for BEAST 2 (Bouckaert et al 2014)

  • We used this method to test the correctness of the novel features in StarBEAST2; analytical population size integration, coordinated operators, and species tree relaxed clocks

Read more

Summary

Introduction

The throughput of sequencing technologies has improved remarkably over the past two decades culminating in generation sequencing (NGS), and it is feasible to sequence whole or partial genomes or transcriptomes for phylogenetic studies (Lemmon and Lemmon 2013). In the case of *BEAST, a fully Bayesian method of species tree inference which implements a realistic and robust evolutionary model in the multispecies coalescent (MSC; Degnan and Rosenberg 2009, Heled and Drummond 2010), it becomes exponentially slower as the number of loci in an analysis is increased. This scaling behavior causes *BEAST to become intractably slow after a certain number of loci (the exact number will depend on other parameters of the data set, see Ogilvie et al 2016). Given the current challenges of using large phylogenomic data sets with *BEAST there have been three broad alternatives available to researchers; concatenate sequences from multiple loci, use heuristic methods statistically consistent with the MSC, or choose a tractable subset of loci to use with a fully Bayesian method like *BEAST, BEST (Liu 2008), or BPP (Yang 2015)

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call