Abstract

BackgroundAccurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. In the Bayesian framework, model selection is typically performed through the evaluation of a Bayes factor, the ratio of two marginal likelihoods (one for each model). Recently introduced techniques to estimate (log) marginal likelihoods, such as path sampling and stepping-stone sampling, offer increased accuracy over the traditional harmonic mean estimator at an increased computational cost. Most often, each model’s marginal likelihood will be estimated individually, which leads the resulting Bayes factor to suffer from errors associated with each of these independent estimation processes.ResultsWe here assess the original ‘model-switch’ path sampling approach for direct Bayes factor estimation in phylogenetics, as well as an extension that uses more samples, to construct a direct path between two competing models, thereby eliminating the need to calculate each model’s marginal likelihood independently. Further, we provide a competing Bayes factor estimator using an adaptation of the recently introduced stepping-stone sampling algorithm and set out to determine appropriate settings for accurately calculating such Bayes factors, with context-dependent evolutionary models as an example. While we show that modest efforts are required to roughly identify the increase in model fit, only drastically increased computation times ensure the accuracy needed to detect more subtle details of the evolutionary process.ConclusionsWe show that our adaptation of stepping-stone sampling for direct Bayes factor calculation outperforms the original path sampling approach as well as an extension that exploits more samples. Our proposed approach for Bayes factor estimation also has preferable statistical properties over the use of individual marginal likelihood estimates for both models under comparison. Assuming a sigmoid function to determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation.

Highlights

  • Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution

  • Laurasiatheria data set As a means of comparison for our proposed approach, we first estimate the marginal likelihood of the presented context-dependent model and a site-independent reference model known as the general time-reversible (GTR) evolutionary model, which contains 5 free evolutionary parameters and 3 free base frequencies

  • As mentioned in the Methods section, the harmonic mean estimator (HME) tends to be biased towards higherdimensional models, meaning that the log Bayes factor shown in Table 1 is possibly an overestimation of the true log Bayes factor

Read more

Summary

Introduction

Accurate model comparison requires extensive computation times, especially for parameter-rich models of sequence evolution. More accurate mathematical models of molecular sequence evolution continue to be developed for good reasons as the additional complexity of such models can lead to the identification of important evolutionary processes that would be missed with simpler models. These models come at a drastically elevated computational cost due to their increase in number of parameters and the need for data augmentation to make the likelihood calculations feasible [2]. It is used to calculate the (log) Bayes factor between two models, which is a ratio of two marginal likelihoods (i.e. two normalizing constants of the form p(Y | M), with Y the observed data and M an evolutionary model under evaluation) obtained for the two models, M0 and M1, under comparison [6]: B10

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call