Abstract
BackgroundInsertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and thus it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Recently, indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment. However, it is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. Moreover, currently none of these models can fully accommodate biologically realistic features, such as overlapping indels, power-law indel-length distributions, and indel rate variation across regions.ResultsHere, we theoretically dissect the ab initio calculation of the probability of a given sequence alignment under a genuine stochastic evolutionary model, more specifically, a general continuous-time Markov model of the evolution of an entire sequence via insertions and deletions. Our model is a simple extension of the general “substitution/insertion/deletion (SID) model”. Using the operator representation of indels and the technique of time-dependent perturbation theory, we express the ab initio probability as a summation over all alignment-consistent indel histories. Exploiting the equivalence relations between different indel histories, we find a “sufficient and nearly necessary” set of conditions under which the probability can be factorized into the product of an overall factor and the contributions from regions separated by gapless columns of the alignment, thus providing a sort of generalized HMM. The conditions distinguish evolutionary models with factorable alignment probabilities from those without ones. The former category includes the “long indel” model (a space-homogeneous SID model) and the model used by Dawg, a genuine sequence evolution simulator.ConclusionsWith intuitive clarity and mathematical preciseness, our theoretical formulation will help further advance the ab initio calculation of alignment probabilities under biologically realistic models of sequence evolution via indels.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1105-7) contains supplementary material, which is available to authorized users.
Highlights
Insertions and deletions account for more nucleotide differences between two related DNA sequences than substitutions do, and it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes
Since the groundbreaking works by Bishop and Thompson [8] and by Thorne, Kishino and Felsenstein [9], many studies have been made to calculate the probabilities of pairwise alignments (PWAs) and multiple sequence alignments (MSAs) under probabilistic models aiming to incorporate the effects of indels
To the best of our knowledge, this is the first study to theoretically dissect the ab initio calculation of alignment probabilities under a genuine stochastic evolutionary model, which describes the evolution of an entire sequence via insertions and deletions along the time axis
Summary
Insertions and deletions (indels) account for more nucleotide differences between two related DNA sequences than substitutions do, and it is imperative to develop a stochastic evolutionary model that enables us to reliably calculate the probability of the sequence evolution through indel processes. Indel probabilistic models are mostly based on either hidden Markov models (HMMs) or transducer theories, both of which give the indel component of the probability of a given sequence alignment as a product of either probabilities of column-to-column transitions or block-wise contributions along the alignment It is not a priori clear how these models are related with any genuine stochastic evolutionary model, which describes the stochastic evolution of an entire sequence along the time-axis. The methods have greatly improved in terms of the computational efficiency and the scope of application (reviewed, e.g., in [10,11,12]) Most of these studies are based on hidden Markov models (HMMs) (e.g., [13]) or transducer theories (e.g., [14]). The studies on these methods are steadily advancing (e.g., [15, 16]), and it seems that their mathematical and algorithmic bases are about to be established
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.