Multi-channel GCN ensembled machine learning model for molecular aqueous solubility prediction on a clean dataset.

Yadong Chen,Haichun Liu,Yanmin Zhang,Yi Hua,Tao Lu,Li Liang,Guomeng Xing,Chenglong Deng

doi:10.1007/s11030-022-10465-x

Abstract

This study constructed a new aqueous solubility dataset and a solubility regression model which was ensembled by GCN and machine learning models. Aqueous solubility is a key physiochemical property of small molecules in drug discovery. In the past few decades, there have been many studies about solubility prediction. However, many of these studies have high root mean squared error (RMSE). Meanwhile, their dataset always contains salt compounds and solubility data obtained from different experimental conditions. In this paper, we constructed a clean dataset with 2609 compounds, which was small but contains only solubility records without salts at the same temperatures (25°C). Here, we applied graph convolutional neural network (GCN) to construct an aqueous solubility prediction model. To enhance the performance of the model, the molecular MACCS key fingerprints and physiochemical descriptors were also combined with the GCN model to build a multi-channel model. Additionally, the authors also built two machine learning models (support vector regression and gradient boost decision tree) and assembled them to the GCN model to improve the root mean squared error (RMSE = 0.665). Finally, comparative experiments have shown that our framework achieved the best performance on ESOL dataset (RMSEval = 0.56, RMSEtest = 0.44) and surpassed four established software on aqueous solubility prediction of new compounds.

Full Text