Abstract

We introduce a systematic method of approximating finite-time transition probabilities for continuous-time insertion-deletion models on sequences. The method uses automata theory to describe the action of an infinitesimal evolutionary generator on a probability distribution over alignments, where both the generator and the alignment distribution can be represented by pair hidden Markov models (HMMs). In general, combining HMMs in this way induces a multiplication of their state spaces; to control this, we introduce a coarse-graining operation to keep the state space at a constant size. This leads naturally to ordinary differential equations for the evolution of the transition probabilities of the approximating pair HMM. The TKF91 model emerges as an exact solution to these equations for the special case of single-residue indels. For the more general case of multiple-residue indels, the equations can be solved by numerical integration. Using simulated data, we show that the resulting distribution over alignments, when compared to previous approximations, is a better fit over a broader range of parameters. We also propose a related approach to develop differential equations for sufficient statistics to estimate the underlying instantaneous indel rates by expectation maximization. Our code and data are available at https://github.com/ihh/trajectory-likelihood.

Highlights

  • IN molecular evolution, the equations of motion describe continuous-time Markov processes on discrete nucleotide or amino acid sequences

  • We will mostly consider alignments of two sequences that we will refer to as the “ancestor” and the “descendant,” where the likelihood function takes the form Pðdescendant; alignmentjancestor; Q; tÞ, where Q represents model parameters and t is a time parameter. Common uses of this likelihood function include performing sequence alignment, finding maximum-likelihood estimates of the time parameter t, and comparing different models or parameterizations Q

  • While we focus here on the multiresidue indel process, the generality of the infinitesimal automata suggests that other local evolutionary models, such as those allowing neighbordependent substitution and indel rates, might be productively analyzed using this approach

Read more

Summary

Introduction

IN molecular evolution, the equations of motion describe continuous-time Markov processes on discrete nucleotide or amino acid sequences. We will mostly consider alignments of two sequences that we will refer to as the “ancestor” and the “descendant,” where the likelihood function takes the form Pðdescendant; alignmentjancestor; Q; tÞ, where Q represents model parameters (e.g., mutation rates) and t is a time parameter. Common uses of this likelihood function include performing sequence alignment (for downstream inference based on homology, such as protein structure prediction), finding maximum-likelihood estimates of the time parameter t (for example, as part of phylogenetic inference of ancestral relationships), and comparing different models or parameterizations Q (for example, to measure the rate of evolution in sequences, or to annotate conserved regions). Most of our discussion will be at this level of summarization

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call