Abstract

We introduce QM7-X, a comprehensive dataset of 42 physicochemical properties for ≈4.2 million equilibrium and non-equilibrium structures of small organic molecules with up to seven non-hydrogen (C, N, O, S, Cl) atoms. To span this fundamentally important region of chemical compound space (CCS), QM7-X includes an exhaustive sampling of (meta-)stable equilibrium structures—comprised of constitutional/structural isomers and stereoisomers, e.g., enantiomers and diastereomers (including cis-/trans- and conformational isomers)—as well as 100 non-equilibrium structural variations thereof to reach a total of ≈4.2 million molecular structures. Computed at the tightly converged quantum-mechanical PBE0+MBD level of theory, QM7-X contains global (molecular) and local (atom-in-a-molecule) properties ranging from ground state quantities (such as atomization energies and dipole moments) to response quantities (such as polarizability tensors and dispersion coefficients). By providing a systematic, extensive, and tightly-converged dataset of quantum-mechanically computed physicochemical properties, we expect that QM7-X will play a critical role in the development of next-generation machine-learning based models for exploring greater swaths of CCS and performing in silico design of molecules with targeted properties.

Highlights

  • Background & SummaryA crucial aspect of drug discovery[1] and molecular materials design[2] is an extensive exploration and understanding of chemical compound space (CCS)—the extremely high-dimensional space containing all feasible molecular compositions and conformations

  • quantum mechanical (QM) calculations on small subsets of the GDB datasets have subsequently been used to generate meta-stable conformations for each molecular composition. This has led to seminal QM-based datasets like QM710,12,13 and QM911,14, which are comprised of a single meta-stable molecular structure per SMILES string with up to seven and nine heavy atoms, respectively

  • In order to convincingly address these four challenges in this work, we present QM7-X, which aims to provide a systematic, extensive, and tightly converged dataset of QM-based physical and chemical properties for a fundamentally important region of CCS covering small organic molecules

Read more

Summary

Background & Summary

A crucial aspect of drug discovery[1] and molecular materials design[2] is an extensive exploration and understanding of chemical compound space (CCS)—the extremely high-dimensional space containing all feasible molecular compositions and conformations. The second challenge is the steep computational cost of tightly converged QM calculations, which are critical for obtaining an accurate and reliable description of the structure and physicochemical properties of each molecule To begin such an extensive exploration of CCS, the GDB datasets[1,10,11] have enumerated up to 166 B organic molecules containing up to 17 heavy (non-hydrogen) atoms. We performed a systematic and exhaustive sampling of the (meta-)stable equilibrium structures of all molecules with up to seven heavy (C, N, O, S, Cl) atoms in the GDB13 database[10] using a density-functional tight binding (DFTB) approach; this includes constitutional/structural isomers and stereoisomers, e.g., enantiomers and diastereomers (including cis-/trans- and conformational isomers) This was followed by the generation of 100 non-equilibrium structures (via DFTB normal-mode displacements of each equilibrium structure) for a total of ≈4.2 M molecular structures. We expect that QM7-X will be useful for the development of accurate and reliable ML-based techniques that will provide new insight into the complex structure–property relationships in molecules, and allow for more extensive exploration of CCS and the rational design of molecules with tailored properties

Methods
Findings
Code availability
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call