Improved Chemical Prediction from Scarce Data Sets via Latent Space Enrichment.

Nicolae C Iovanac,Brett M Savoie

doi:10.1021/acs.jpca.9b01398

Abstract

Modern machine learning provides promising methods for accelerating the discovery and characterization of novel chemical species. However, in many areas experimental data remain costly and scarce, and computational models are unavailable for targeted figures of merit. Here we report a promising pathway to address this challenge by using chemical latent space enrichment, whereby disparate data sources are combined in joint prediction tasks to enable improved prediction in data-scarce applications. The approach is demonstrated for p Ka prediction of moderately sized molecular species using a combination of experimentally available p Ka data and density functional theory-based characterizations of the (de)protonation free energy. A novel autoencoder framework is used to create a continuous chemical latent space that is then used in single and joint training tasks for property prediction. By combining these two data sets in a jointly trained autoencoder framework, we observe mutual improvement in property prediction tasks in the scarce data limit. We also demonstrate an enrichment mechanism that is unique to latent space training, whereby training on excess computational data can mitigate the prediction losses associated with scarce experimental data and advantageously organize the latent space. These results demonstrate that disparate chemical data sources can be advantageously combined in an autoencoder framework with potential general application to data-scarce chemical learning tasks.

Full Text