The aim of the present work is to develop quantitative structure–property relationship (QSPR) models for adsorption capability of a large dataset of chemicals (n=3483) on to activated carbon. Two different splitting techniques like k-means clustering and principal component analysis (PCA) combined with duplex method were used to divide the data set into training and test sets. Attempt was made to find out the common descriptors present in various models indicating their importance for adsorption capacity on to activated carbon. In spite of presence of large number of compounds in the training and test sets (3:1 in size ratio), we did not omit any compounds showing outlier behavior to artificially show enhanced values of validation metrics thus ensuring the predictive quality of the models for diverse types of compounds. The models were developed to study the predictive ability of extended topochemical atom (ETA) parameters which are calculated from two-dimensional representation of molecules and introduced by the present group of authors. The ETA models were compared to non-ETA models involving topological, spatial and structural descriptors. In all the cases, the data set was first subjected to stepwise regression to find out the contributing variables, and the selected variables were further subjected to partial least squares (PLS) regression. The PLS models indicate that ETA descriptors provide better external validation characteristics in terms of predictive R2 than that of the non-ETA ones. The best ETA model shows encouraging statistical quality (Qint2=0.8059, Qext(F1)2=0.7914, Qext(F2)2=0.7909, Qext(F3)2=0.8492).
Read full abstract