The chemical properties of oils are vital in the design of microemulsion systems. The hydrophilic-lipophilic difference equation used to predict microemulsions' phase behavior expresses the oils' physiochemical properties as the equivalent alkane carbon number (EACN). The experimental determination of EACN requires knowledge of the temperature dependence of the microemulsion system and the effects of different surfactant concentrations. Thus, the experimental determination is time-intensive and tedious, requiring days to months for proper separations. Furthermore, the experiments require high purity of chemicals because microemulsions are sensitive to impurities. Our work focuses on the quick and reliable predictions of the EACN with machine learning (ML) models. Due to the immaturity of ML chemical predictions, we compare three graph neural networks (GNNs) and a gradient-boosted tree algorithm, known as XGBoost. The GNNs use the molecular structures represented as simplified molecular-input line-entry system (SMILES) codes for the initial input, which allows us to assess whether geometry optimization is necessary for reliable results. The XGBoost model also begins with the SMILES representations of the molecules but uses molecular descriptors instead of geometry optimizations. The best model tested (crystal graph convolutional neural network with Merck molecular force field-94) has an error of 1.15 EACN units of the true EACN for unknown data with the errors skewed toward zero and an R2 score of 0.9.
Read full abstract