Abstract
Unsupervised learning is becoming an essential tool to analyze the increasingly large amounts of data produced by atomistic and molecular simulations, in material science, solid state physics, biophysics, and biochemistry. In this Review, we provide a comprehensive overview of the methods of unsupervised learning that have been most commonly used to investigate simulation data and indicate likely directions for further developments in the field. In particular, we discuss feature representation of molecular systems and present state-of-the-art algorithms of dimensionality reduction, density estimation, and clustering, and kinetic models. We divide our discussion into self-contained sections, each discussing a specific method. In each section, we briefly touch upon the mathematical and algorithmic foundations of the method, highlight its strengths and limitations, and describe the specific ways in which it has been used-or can be used-to analyze molecular simulation data.
Highlights
In recent years, we have witnessed a substantial expansion in the amount of data generated by molecular simulation
Throughout the review we present these techniques highlighting their specific application to the analysis of molecular dynamics, and discussing their advantages and disadvantages in this context
Time-lagged independent component analysis (TICA) has been leveraged to analyze a variety of biomolecular systems from both simulation and experimental data including the dynamics of protein folding,[252] disordered proteins,[317] protein−peptide, and protein−protein association,[29,255] protein conformational change and ligand binding,[318] binding-induced folding,[256] and kinase functional dynamics[257] TICA has been integrated into enhanced sampling algorithms.[319,320]
Summary
We have witnessed a substantial expansion in the amount of data generated by molecular simulation. A striking example is given by the kinetics of complex conformational changes in biomolecules, which, on long time scales, can be well described by transition rates between a few discrete states Symmetries, such as the invariance of physical properties under translation, rotation, or permutation of equivalent particles, can be leveraged to obtain a more compact representation of simulation data. This set of approaches is qualitatively based on the requirement that a meaningful low-dimensional model should reproduce the relevant time-correlation properties of the original dynamics (e.g., the transition rates). Other valuable review articles of potential significance to the reader interested in machine learning for molecular and materials science are ref 5−9
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.