Abstract

Rationalizing the structure and structure–property relations for complex materials such as polymers or biomolecules relies heavily on the identification of local atomic motifs, e.g., hydrogen bonds and secondary structure patterns, that are seen as building blocks of more complex supramolecular and mesoscopic structures. Over the past few decades, several automated procedures have been developed to identify these motifs in proteins given the atomic structure. Being based on a very precise understanding of the specific interactions, these heuristic criteria formulate the question in a way that implies the answer, by defining a list of motifs based on those that are known to be naturally occurring. This makes them less likely to identify unexpected phenomena, such as the occurrence of recurrent motifs in disordered segments of proteins, and less suitable to be applied to different polymers whose structure is not driven by hydrogen bonds, or even to polypeptides when appearing in unusual, non-biological conditions. Here we discuss how unsupervised machine learning schemes can be used to recognize patterns based exclusively on the frequency with which different motifs occur, taking high-resolution structures from the Protein Data Bank as benchmarks. We first discuss the application of a density-based motif recognition scheme in combination with traditional representations of protein structure (namely, interatomic distances and backbone dihedrals). Then, we proceed one step further toward an entirely unbiased scheme by using as input a structural representation based on the atomic density and by employing supervised classification to objectively assess the role played by the representation in determining the nature of atomic-scale patterns.

Highlights

  • Macromolecules are characterized by their capability of folding and assembling into hierarchical structures, which is a crucial element in their activity and stability

  • The analysis protocols that we have discussed above identify the presence of significant motifs based exclusively on how often a given local atomistic environment occurs in a reference dataset. While this procedure makes it possible to rely on simple and rather generic descriptors of local structure, it still requires a dose of chemical intuition, i.e., it is necessary to know the basis of hydrogen bonding and that dihedral angles can be used to identify the secondary structure of a protein

  • Given that the Smooth Overlap of Atomic Positions (SOAP) representation can be tuned to encompass environments of different sizes and provide a complete description of the correlation between atomic positions, it gives us an opportunity to verify whether any discrepancy between the Probabilistic Analysis of Molecular Motifs (PAMM) classification and the reference heuristics is due to the fact that the truncated representations that we use are incomplete, or due to the fact that the reference heuristics are not reflected in the probability distribution of motifs in the PDB

Read more

Summary

INTRODUCTION

Macromolecules are characterized by their capability of folding and assembling into hierarchical structures, which is a crucial element in their activity and stability. Rosetta, one of the most well-known energy functions, has been developed to predict the structure of a protein given its amino acid sequence and local structural features such as dihedral angles (Simons et al, 1997, 1999) Another example where purely data-driven definitions would be advantageous is in secondary structure classification. While several methods exist to classify protein secondary structure (Kabsch and Sander, 1983; Frishman and Argos, 1995, 1996; Jones, 1999; Cuff and Barton, 2000; Andersend et al, 2002; Martin et al, 2005; Nagy and Oostenbrink, 2014; Haghighi et al, 2016), these methods rely on amino acid sequences, hydrogen bonding energies, geometrical criteria, or some combination thereof. By comparing the fidelity of the unsupervised classification given by PAMM with that of a supervised scheme, we can assess whether classification errors stem from an incomplete representation or are a manifestation of the arbitrary nature of heuristic methods

METHODS
Hydrogen Bond Definitions
Clustering Parameters
Dihedral Angles for Secondary
Clustering and Secondary Structure
Comparison of Secondary-Structure Definitions
Smooth Overlap of Atomic Positions
Brief Introduction to SOAP
Supervised Classification
Hydrogen Bonds
Dihedral Angles and Protein
SOAP Environments
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call