Abstract
Few models of sequence evolution incorporate parameters describing protein structure, despite its high conservation, essential functional role and increasing availability. We present a structurally aware empirical substitution model for amino acid sequence evolution in which proteins are expressed using an expanded alphabet that relays both amino acid identity and structural information. Each character specifies an amino acid as well as information about the rotamer configuration of its side-chain: the discrete geometric pattern of permitted side-chain atomic positions, as defined by the dihedral angles between covalently linked atoms. By assigning rotamer states in 251,194 protein structures and identifying 4,508,390 substitutions between closely related sequences, we generate a 55-state “Dayhoff-like” model that shows that the evolutionary properties of amino acids depend strongly upon side-chain geometry. The model performs as well as or better than traditional 20-state models for divergence time estimation, tree inference, and ancestral state reconstruction. We conclude that not only is rotamer configuration a valuable source of information for phylogenetic studies, but that modeling the concomitant evolution of sequence and structure may have important implications for understanding protein folding and function.
Highlights
The development of evolutionary models is a prerequisite for many common bioinformatics tasks such as recognition of homologous sequences, phylogenetic tree estimation, evolutionary hypothesis testing and protein structure prediction (Huelsenbeck and Rannala 1997, Felsenstein 2004, Koonin 2005, Ginalski 2006)
By compiling a large set of homologous sequences for which structural data are available, we develop a structurally-aware “Dayhoff-like” substitution model based on an instantaneous rate matrix that uses an expanded state set composed of 55 states, each of which corresponds to the combination of a residue and its χ1 configuration (Table 1)
We show that RAM55 can accurately reconstruct ancestral rotamer states from descendant protein sequences of known structure; it is able to reconstruct ancestral amino acid states as well as or better than traditional 20-state models
Summary
The development of evolutionary models is a prerequisite (albeit sometimes an implicit one) for many common bioinformatics tasks such as recognition of homologous sequences, phylogenetic tree estimation, evolutionary hypothesis testing and protein structure prediction (Huelsenbeck and Rannala 1997, Felsenstein 2004, Koonin 2005, Ginalski 2006). When studying the evolution of amino acid sequences, substitutions are usually described using a continuous-time Markov model with the 20 amino acids as the states of the chain (Liò et al 1998, Felsenstein 2004, Thorne and Goldman 2007, Perron et al ressin press). Models belonging to the empirical class are built by analysing large quantities of sequence data (typically hundreds of protein alignments) and estimating relative substitution rates between all state (amino acid) pairs under a timereversible model. Empirical models are typically assumed to be applicable to broad classes of proteins with little further parameter optimization aside from techniques that match amino acid frequencies to what is observed in a specific dataset under study and allow for rate heterogeneity amongst sequence sites (Yang 1993)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.