Abstract

Recent advances in statistical machine learning techniques have led to the creation of probabilistic programming frameworks. These frameworks enable probabilistic models to be rapidly prototyped and fit to data using scalable approximation methods such as variational inference. In this work, we explore the use of the Stan language for probabilistic programming in application to phylogenetic models. We show that many commonly used phylogenetic models including the general time reversible substitution model, rate heterogeneity among sites, and a range of coalescent models can be implemented using a probabilistic programming language. The posterior probability distributions obtained via the black box variational inference engine in Stan were compared to those obtained with reference implementations of Markov chain Monte Carlo (MCMC) for phylogenetic inference. We find that black box variational inference in Stan is less accurate than MCMC methods for phylogenetic models, but requires far less compute time. Finally, we evaluate a custom implementation of mean-field variational inference on the Jukes–Cantor substitution model and show that a specialized implementation of variational inference can be two orders of magnitude faster and more accurate than a general purpose probabilistic implementation.

Highlights

  • Markov chain Monte Carlo (MCMC) algorithms have become the workhorse of Bayesian phylogenetic inference since they were introduced in the late 1990’s (Mau & Newton, 1997; Larget & Simon, 1999)

  • We analyzed a set of heterochronous influenza A virus sequences under the strict clock model on a fixed topology with BEAST2 and phylostan

  • We have developed a tool based on the Stan package for Bayesian phylogenetic inference, which to our knowledge is the first application of variational Bayes (VB) to time trees with coalescent models

Read more

Summary

Introduction

Markov chain Monte Carlo (MCMC) algorithms have become the workhorse of Bayesian phylogenetic inference since they were introduced in the late 1990’s (Mau & Newton, 1997; Larget & Simon, 1999). Recent advances in computing hardware and corresponding software implementations have allowed this class of inference method to handle increasingly large datasets (Flouri et al, 2015; Ayres et al, 2019). The quantity of sequence data being generated every year has been growing exponentially, which, when combined with practitioner’s desires to conduct inference on increasingly rich statistical models, makes MCMC algorithms difficult to apply in practice because they are too slow to compute.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.