QM40, Realistic Quantum Mechanical Dataset for Machine Learning in Molecular Science

Ayesh Madushanka,Renaldo T Moura,Elfi Kraka

doi:10.1038/s41597-024-04206-y

Ayesh Madushanka, Renaldo T Moura + Show 1 more

https://doi.org/10.1038/s41597-024-04206-y

Copy DOI

Export

Save

Cite

Journal: Scientific Data	Publication Date: Dec 18, 2024
License type: CC BY-NC-ND 4.0

Abstract
Full-Text
Similar Papers

Abstract

Listen

The growing popularity of machine learning (ML) and deep learning (DL) in scientific fields is hindered by the scarcity of high-quality datasets. While quantum mechanical (QM) predictions using DL techniques such as graph neural networks (GNNs) and generative models are gaining traction, insufficient training data remains a bottleneck. The QM40 dataset addresses this challenge by representing 88% of the FDA-approved drug chemical space. It includes molecules containing 10 to 40 atoms and composed of elements commonly found in drug molecular structures (C, O, N, S, F, Cl). QM40 offers valuable resources for researchers which include the core QM40 main dataset, containing 16 key quantum mechanical parameters for 162,954 molecules calculated using the B3LYP/6-31G(2df,p) level of theory in Gaussian16, ensuring consistency with established datasets like QM9 and Alchemy. This compatibility allows for future concatenation of QM40 with these datasets. In addition to other valuable information, the QM40 dataset offers the initial and optimized Cartesian coordinates, Mulliken charges, and detailed bond information, including local vibrational mode force constants, which serve as indicators of bond strength. QM40 can be used to benchmark both existing and new methods for predicting QM calculations using ML and DL techniques.

Full Text