Abstract

AbstractWe present a collection of publicly available intrinsic aqueous solubility data of 829 drug‐like compounds. Four different machine learning algorithms (random forests [RF], LightGBM, partial least squares, and least absolute shrinkage and selection operator [LASSO]) coupled with multistage permutation importance for feature selection and Bayesian hyperparameter optimization were used for the prediction of solubility based on chemical structural information. Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) (test) of 0.70 log points, an R2(test) of 0.80, and 105 features. Taking into account the number of descriptors as well, an RF model achieves the best balance between complexity and predictive ability with an RMSE(test) of 0.72 log points, an R2(test) of 0.78, and with only 17 features. On a more aggressive test set (principal component analysis [PCA]‐based split), better generalization was observed for the RF model. We propose a ranking score for choosing the best model, as test set performance is only one of the factors in creating an applicable model. The ranking score is a weighted combination of generalization, number of features, and test performance. Out of the two best learners, a consensus model was built exhibiting the best predictive ability and generalization with RMSE(test) of 0.67 log points and a R2(test) of 0.81.

Highlights

  • Solubility is a critical topic in pharmaceutical development as it can be a limiting factor to drug absorption.[1]

  • Our results show that LASSO yielded the best predictive ability on an external test set with a root mean square error (RMSE) of 0.70 log points, an R2(test) of 0.80, and 105 features

  • It could be expected that similar descriptors utilized in models and a worse RMSE(test) of random forests (RF) comparing to LASSO would deteriorate the extrapolation capability, which was not the case since RF performed better in this more challenging task

Read more

Summary

| INTRODUCTION

Solubility is a critical topic in pharmaceutical development as it can be a limiting factor to drug absorption.[1]. A comparison with previous studies is difficult because the authors often analyze the model quality in different manners (train, test, cross-validation, out-of-fold) and involved a multitude of model metrics.[31] for the intrinsic solubility, literature values of the predictive performance of models on external test sets expressed by RMSE appear to vary between 0.7 and 1.05 log points[13,15,17,18,26,28,32] using a plethora of machine learning algorithms and datasets. Our goal in this work was to conduct a large-scale machine learning study to investigate how one can achieve robust predictions while retaining minimum model complexity For this purpose, we curated a novel intrinsic solubility dataset from literature sources. We present a more challenging test set to test the models' extrapolation capabilities

| MATERIALS AND METHODS
| RESULTS AND DISCUSSION
| CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call