Automated identification of isofragmented reactions and application in correcting molecular property models

Aidan O'Donnell,Bowen Li,Srinivas Rangarajan,Chrysanthos E Gounaris

doi:10.1016/j.ces.2023.119411

Abstract

Machine learning techniques are increasingly being employed to predict molecular properties. Such models are often trained on large computationally derived datasets, and are only as accurate as the underlying data. We exploit the well-known error cancelling effect of isodesmic and homodesmotic reactions to develop a multi-fidelity data-driven molecular property prediction method. First, we propose an optimization-based scheme to quickly and automatically identify all isofragmented reactions for a target molecule u, i.e., balanced reactions involving u and one or more molecules from a given set of molecules (M), which conserve a predefined set of fragments of arbitrary size. Second, we show that such isofragmented reactions can be leveraged to improve the predictive accuracy of a data-driven model by infusing a small high-accuracy dataset comprising molecules in M. We applied this method with a high-accuracy subset of the NIST thermochemistry database and a simple additive data-driven model trained on a QM9 subset. Our results show that the heats of formation using our method were ∼4.4 kcal/mol more accurate on average than the data-driven model.

Full Text