Abstract

Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant ‘patterns’ of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.

Highlights

  • Thanks to the constant progresses in DNA sequencing techniques, more than 4,400 full genomes are sequenced [1], resulting in more than 3:6 107 known protein sequences [2], which are classified into more than 14,000 protein domain families [3], many of them containing in the range of 103{105 homologous amino-acid sequences

  • We show that the dimensional reduction, which is achieved by considering only the statistically most significant patterns, avoids overfitting in small sequence alignments, and improves our capacity of extracting residue contacts in this case

  • Principal component analysis can be applied to the L-dimensional C~ Statistical Coupling Analysis (SCA) matrix, and used to define socalled sectors, i.e. clusters of evolutionarily correlated sites. To bridge these two approaches – direct coupling analysis (DCA) and principal component analysis (PCA) – we introduce the Hopfield-Potts model for the maximum likelihood modeling of the sequence distribution, given the residue frequencies fi(a) and their pairwise correlations fij(a,b)

Read more

Summary

Introduction

Thanks to the constant progresses in DNA sequencing techniques, more than 4,400 full genomes are sequenced [1], resulting in more than 3:6 107 known protein sequences [2], which are classified into more than 14,000 protein domain families [3], many of them containing in the range of 103{105 homologous (i.e. evolutionarily related) amino-acid sequences. A natural idea is to analyze covariations between residues, that is, whether their variations across sequences are correlated or not [7] In this context, one introduces a matrix Cij(a,b) of residue-residue correlations expressing how much the presence of amino-acid ‘a’ in position ‘i’ on the protein is correlated across the sequence data with the presence of another amino-acid ‘b’ in another position ‘j’. One introduces a matrix Cij(a,b) of residue-residue correlations expressing how much the presence of amino-acid ‘a’ in position ‘i’ on the protein is correlated across the sequence data with the presence of another amino-acid ‘b’ in another position ‘j’ Extracting information from this matrix has been the subject of numerous studies over the past two decades, see e.g. A full dynamical model for residue coevolution is still outstanding

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.