Thermochemical Data Fusion Using Graph Representation Learning

Himaghna Bhattacharjee,Dionisios G Vlachos

doi:10.1021/acs.jcim.0c00699

Abstract

Large databases are required for "Big Data" applications in catalysis and materials science. Thermochemical databases can be created by combining data from various sources and by correcting low-fidelity data sets to higher accuracy with minimal computation. To achieve this "data fusion", thermochemical quantities of interest, calculated at various levels of density functional theory (DFT), need to be mapped to the same, high levels of theory. In this work, a graph theoretical, statistical framework is proposed for such tasks. Subgraph frequencies are shown to provide a natural representation for learning these fusion maps. The maps are linear and are learnt with automated descriptor selection. Using a data set of as few as ∼1% from the QM9 database of 133 885 molecules, these models can predict multiple thermochemical quantities at a higher level of theory with an accuracy of 1 kcal/mol. The method is explainable, generalizable, and provides a diagnostic tool for outlier identification.

Full Text