Abstract
Divergence date estimates are central to understand evolutionary processes and depend, in the case of molecular phylogenies, on tests of molecular clocks. Here we propose two non-parametric tests of strict and relaxed molecular clocks built upon a framework that uses the empirical cumulative distribution (ECD) of branch lengths obtained from an ensemble of Bayesian trees and well known non-parametric (one-sample and two-sample) Kolmogorov-Smirnov (KS) goodness-of-fit test. In the strict clock case, the method consists in using the one-sample Kolmogorov-Smirnov (KS) test to directly test if the phylogeny is clock-like, in other words, if it follows a Poisson law. The ECD is computed from the discretized branch lengths and the parameter λ of the expected Poisson distribution is calculated as the average branch length over the ensemble of trees. To compensate for the auto-correlation in the ensemble of trees and pseudo-replication we take advantage of thinning and effective sample size, two features provided by Bayesian inference MCMC samplers. Finally, it is observed that tree topologies with very long or very short branches lead to Poisson mixtures and in this case we propose the use of the two-sample KS test with samples from two continuous branch length distributions, one obtained from an ensemble of clock-constrained trees and the other from an ensemble of unconstrained trees. Moreover, in this second form the test can also be applied to test for relaxed clock models. The use of a statistically equivalent ensemble of phylogenies to obtain the branch lengths ECD, instead of one consensus tree, yields considerable reduction of the effects of small sample size and provides a gain of power.
Highlights
The molecular clock hypothesis postulates that for a given informational macro-molecule (DNA or protein sequence) the evolutionary rate is approximately constant over time in all evolutionary lines of descent
Definition of PKS sample sizes: The unadjusted sample size is defined as N = τ B, where B is number of branches of the trees and τ is the least number of trees that satisfies the following two conditions: (i) τ ! number of taxa; (ii) DPKS is minimal with respect to τ, that is, the fit of the empirical cumulative distribution (ECD) is the best possible, given that condition (i) is satisfied
The adjusted sample size is defined as NADJ = k N = k τ B, where the auto-correlation coefficient is defined as k = TESS/T, with T the total number of trees generated by the Markov Chain Monte Carlo (MCMC) sampler, after the burn-in and TESS the effective sample size associated to the tree lengths (TL) computed by the MCMC sampler
Summary
The molecular clock hypothesis postulates that for a given informational macro-molecule (DNA or protein sequence) the evolutionary rate is approximately constant over time in all evolutionary lines of descent This implies that if genetic divergence accumulates in a stochastic clock-like manner, that is, approximately constant number of mutations accumulated per time interval, time scales could be determined for evolutionary events, with calibration using fossil evidence. In the strict neutral model the dynamics depends on the neutral mutation rate alone, one may expect most sites in a functional protein to be constrained during most of the evolutionary time This observation motivated the introduction of doubly stochastic Poisson process, or Cox process, as a model for the substitution process, implying that positive selection, if it occurs, is in episodic fashion and should affect only a few sites [7,8]. These ideas motivated the introduction of relaxed molecular clock models and advanced their use for inferring dates of divergence events
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have