In this work, we investigated three different machine learning (ML)-based models, i.e., gaussian process regression (GPR), LightGBM, and CatBoost, for predicting the solubility of CO2 in various ionic liquids (ILs). Three molecular descriptors, i.e., group contribution (GC), molecular structure descriptors (MSD), and hybrid GC-MSD are used in our three models. The performance of our developed models were rigorously evaluated using mean absolute error (MAE), coefficient of determination (R2), and mean relative error (MRE) (i.e., relative deviation in percentage), with each model subjected to multiple tests employing different random state parameters. The dataset underwent partitioning into training and testing sets at an 80:20 ratio, with additional splits at various ratios to assess prediction performance sensitivity. Overall, all models exhibited proficient CO2 solubility prediction in ILs, with performance varying based on descriptor type. Notably, the hybrid GC-MSD consistently outperformed others, attributed to GC-MSD incorporates a broader array of molecular feature information. Particularly, the CatBoost-GC-MSD model excelled, achieving an impressive R2 of 0.9925, MAE of 0.0122, and MRE of 11.1550%. Comparing our models to previous studies revealed the superior performance of CatBoost-GC-MSD across all descriptor types. Furthermore, our model interpretation, employing shapley additive explanation (SHAP) analysis, identified pressure, temperature, Chi0, Kappa2, and EState_VSA10 as the top five influential input features. These findings provide valuable insights into the underlying molecular features affecting CO2 solubility in ILs and lay the foundation for future research in this field.
Read full abstract