Abstract

Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.

Highlights

  • In recent years, we have witnessed a substantial expansion in the amount of data generated by molecular simulation

  • Throughout the review we present these techniques highlighting their specific application to the analysis of molecular dynamics, and discussing their advantages and disadvantages in this context

  • Time-lagged independent component analysis (TICA) has been leveraged to analyze a variety of biomolecular systems from both simulation and experimental data including the dynamics of protein folding,[252] disordered proteins,[317] protein−peptide, and protein−protein association,[29,255] protein conformational change and ligand binding,[318] binding-induced folding,[256] and kinase functional dynamics[257] TICA has been integrated into enhanced sampling algorithms.[319,320]

Read more

Summary

INTRODUCTION

We have witnessed a substantial expansion in the amount of data generated by molecular simulation. A striking example is given by the kinetics of complex conformational changes in biomolecules, which, on long time scales, can be well described by transition rates between a few discrete states Symmetries, such as the invariance of physical properties under translation, rotation, or permutation of equivalent particles, can be leveraged to obtain a more compact representation of simulation data. This set of approaches is qualitatively based on the requirement that a meaningful low-dimensional model should reproduce the relevant time-correlation properties of the original dynamics (e.g., the transition rates). Other valuable review articles of potential significance to the reader interested in machine learning for molecular and materials science are ref 5−9

FEATURE REPRESENTATION
Representations for Macromolecular Systems
Representations for Condensed Matter Systems
Representation Learning
DIMENSIONALITY REDUCTION AND MANIFOLD LEARNING
Linear Dimensionality Reduction Methods
Nonlinear Dimensionality Reduction
DENSITY ESTIMATION
Parametric Density Estimation
Nonparametric Density Estimation
CLUSTERING
Partitioning Schemes
Density-Based Clustering
KINETIC MODELS
Time-Lagged Independent Component Analysis
Variational Approach to Conformational Dynamics
Markov State Modeling
Koopman Models and VAMP
VAMPnets
Feature Representations
Dimensionality Reduction
Density Estimation
Clustering
CONCLUSION AND DISCUSSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.