Machine Learning (ML) models were developed in this work to predict self-diffusion coefficients (D11) of dense fluids using four training algorithms: Gradient Boosting, k-Nearest Neighbors, Decision Tree, and Random Forest. A database of 7931 experimental points from 223 substances, at different pressures and temperatures, was used to train the models. From an initial set of 34 input features (variables/properties), the eight most important ones, ranked by decreasing relevance, were: density, acentric factor, temperature, critical temperature, critical volume, number of NH and/or OH bonds, pressure, and number of rotatable bonds. The best performance was achieved by models using the first 5 and 8 input features (ML5-D11 and ML8-D11) using the Gradient Boosting algorithm, for which the average absolute relative deviations (AARDglobal) were 9.06 % and 7.14 %, for the test set. The performance of the ML5-D11 and ML8-D11 models was compared with four phenomenological models – the predictive Zhu et al. equation, the 2-parameters Dymond-Hildebrand-Batschinski correlation, and the 1-parameter and 4-parameters Lennard-Jones correlations (LJ1 and LJ4) – which showed AARDglobal of 104.05 %, 82.94 %, 16.92 % and 7.97 %, respectively, for the same test set. Despite the good results of LJ4 equation, it is worth noting it embodies 4 substance-specific parameters that must be fitted to experimental data in advance. The new ML5-D11 and ML8-D11 models are purely predictive and can be applied to polar/nonpolar, spherical/non-spherical, and even hydrogen-bonding molecules in liquids, compressed gases or supercritical fluids. The ML5-D11 and ML8-D11 are provided for use as a Python program.
Read full abstract