Abstract
Data science and machine learning in materials science require large datasets of technologically relevant molecules or materials. Currently, publicly available molecular datasets with realistic molecular geometries and spectral properties are rare. We here supply a diverse benchmark spectroscopy dataset of 61,489 molecules extracted from organic crystals in the Cambridge Structural Database (CSD), denoted OE62. Molecular equilibrium geometries are reported at the Perdew-Burke-Ernzerhof (PBE) level of density functional theory (DFT) including van der Waals corrections for all 62 k molecules. For these geometries, OE62 supplies total energies and orbital eigenvalues at the PBE and the PBE hybrid (PBE0) functional level of DFT for all 62 k molecules in vacuum as well as at the PBE0 level for a subset of 30,876 molecules in (implicit) water. For 5,239 molecules in vacuum, the dataset provides quasiparticle energies computed with many-body perturbation theory in the G0W0 approximation with a PBE0 starting point (denoted GW5000 in analogy to the GW100 benchmark set (M. van Setten et al. J. Chem. Theory Comput. 12, 5076 (2016))).
Highlights
Background & SummaryConsistent and curated datasets have facilitated progress in the natural sciences
The QM8 database offers optical spectra computed with time-dependent density functional theory (TDDFT) for 22 k organic molecules, while QM9, widely known as one of the standard benchmark sets for machine learning in chemistry, provides a variety of properties for 134 k organic molecules computed with density functional theory (DFT)[19,20], including energy levels for the highest occupied and the lowest unoccupied molecular orbitals (HOMO and LUMO, respectively)
We have based the spectroscopic dataset presented in this article on a diverse collection of 64,725 organic crystals that were extracted from the Cambridge Structural Database (CSD)[29] by Schober et al.[30,31]
Summary
Consistent and curated datasets have facilitated progress in the natural sciences. High-quality reference data sets were, for example, essential in the development of accurate computational methodology, in particular in quantum chemistry. We have based the spectroscopic dataset presented in this article on a diverse collection of 64,725 organic crystals that were extracted from the Cambridge Structural Database (CSD)[29] by Schober et al.[30,31]. To go into more detail, all molecules in OE62 are fully relaxed at the Perdew-Burker-Ernzerhof (PBE)[34] level of DFT including Tkatchenko-Scheffler van der Waals (TS-vdW) corrections[35] For these equilibrium structures, we report molecular orbital energies at the PBE and PBE hybrid (PBE0)[36,37] level, in the following referring to this part as 62 k set.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.