Abstract

BackgroundMany models of protein sequence evolution, in particular those based on Point Accepted Mutation (PAM) matrices, assume that its dynamics is Markovian. Nevertheless, it has been observed that evolution seems to proceed differently at different time scales, questioning this assumption. In 2011 Kosiol and Goldman proved that, if evolution is Markovian at the codon level, it can not be Markovian at the amino acid level. However, it remains unclear up to which point the Markov assumption is verified at the codon level.ResultsHere we show how also the among-site variability of substitution rates makes the process of full protein sequence evolution effectively not Markovian even at the codon level. This may be the theoretical explanation behind the well known systematic underestimation of evolutionary distances observed when omitting rate variability. If the substitution rate variability is neglected the average amino acid and codon replacement probabilities are affected by systematic errors and those with the largest mismatches are the substitutions involving more than one nucleotide at a time. On the other hand, the instantaneous substitution matrices estimated from alignments with the Markov assumption tend to overestimate double and triple substitutions, even when learned from alignments at high sequence identity.ConclusionsThese results discourage the use of simple Markov models to describe full protein sequence evolution and encourage to employ, whenever possible, models that account for rate variability by construction (such as hidden Markov models or mixture models) or substitution models of the type of Le and Gascuel (2008) that account for it explicitly.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1135-1) contains supplementary material, which is available to authorized users.

Highlights

  • Many models of protein sequence evolution, in particular those based on Point Accepted Mutation (PAM) matrices, assume that its dynamics is Markovian

  • Since the publication of the work by Dayhoff and Eck [1] introducing for the first time the concept of PAM matrices, protein sequence evolution has been typically modeled as a time-homogeneous Markov process and each protein site is assumed to be ruled by the same dynamic laws and to evolve independently from the others and from its own past history

  • Substitution matrices both on amino acids and on codons tend to present high rates for double and triple instantaneous substitution rates, i.e. substitutions between codons differing by more than one nucleotide or between amino acids whose codons differ all for more the one nucleotide. This phenomenon, according to biochemical wisdom, does not seem realistic and may hint to further violation of the Markov assumption not kept into account even when describing the evolution at the codon level. Another important result in the description of protein sequence evolution was obtained in 1993 by Yang [13, 14], who proved that the estimations of evolutionary distances and evolutionary trees improve if the variability of substitution rates over sites is accounted for

Read more

Summary

Introduction

Many models of protein sequence evolution, in particular those based on Point Accepted Mutation (PAM) matrices, assume that its dynamics is Markovian. Substitution matrices both on amino acids and on codons tend to present high rates for double and triple instantaneous substitution rates, i.e. substitutions between codons differing by more than one nucleotide or between amino acids whose codons differ all for more the one nucleotide This phenomenon, according to biochemical wisdom, does not seem realistic and may hint to further violation of the Markov assumption not kept into account even when describing the evolution at the codon level. Another important result in the description of protein sequence evolution was obtained in 1993 by Yang [13, 14], who proved that the estimations of evolutionary distances and evolutionary trees improve if the variability of substitution rates over sites is accounted for. This rate variability, which is typically modeled by a gamma distribution [13, 14], is due to many effects, including different structural and functional constraints [15] and coevolution inducing a coupling between substitutions at close-by sites [16, 17]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.