Abstract We present machine learning models based on kernel-ridge regression for predicting X-ray photoelectron spec- tra of organic molecules originating from the ionization energies of 1s electrons in carbon (C), nitrogen (N), oxygen (O), and fluorine (F) atoms. We constructed training dataset through high-throughput calculations of core-electron binding energies (CEBEs) for 12,880 small organic molecules in the bigQM7ω dataset, employing the ∆-SCF formalism coupled with meta-GGA-DFT and a variationally converged basis set. The models are cost-effective, as they require the atomic coordinates of a molecule generated using universal force fields while estimating the target-level CEBEs corresponding to DFT-level equilibrium geometry. We explore transfer learning by utilizing the atomic environment feature vectors learned using a graph neural network framework in kernel-ridge regression. Additionally, we enhance accuracy using the ∆-machine learning framework by leveraging inexpensive baseline spectra derived from Kohn–Sham eigenvalues. Upon application to 208 com- binatorially substituted uracil molecules, larger than those in the training set, our analyses reveal that while the models may not yield quantitatively accurate predictions of CEBEs on a molecule-by-molecule basis, they do exhibit a strong linear correlation, which proves valuable for virtual high-throughput screening purposes. We present the dataset and models as the Python module, cebeconf, to facilitate further explorations.
Read full abstract