Abstract

Graph neural networks (GNN) has been considered as an attractive modelling method for molecular property prediction, and numerous studies have shown that GNN could yield more promising results than traditional descriptor-based methods. In this study, based on 11 public datasets covering various property endpoints, the predictive capacity and computational efficiency of the prediction models developed by eight machine learning (ML) algorithms, including four descriptor-based models (SVM, XGBoost, RF and DNN) and four graph-based models (GCN, GAT, MPNN and Attentive FP), were extensively tested and compared. The results demonstrate that on average the descriptor-based models outperform the graph-based models in terms of prediction accuracy and computational efficiency. SVM generally achieves the best predictions for the regression tasks. Both RF and XGBoost can achieve reliable predictions for the classification tasks, and some of the graph-based models, such as Attentive FP and GCN, can yield outstanding performance for a fraction of larger or multi-task datasets. In terms of computational cost, XGBoost and RF are the two most efficient algorithms and only need a few seconds to train a model even for a large dataset. The model interpretations by the SHAP method can effectively explore the established domain knowledge for the descriptor-based models. Finally, we explored use of these models for virtual screening (VS) towards HIV and demonstrated that different ML algorithms offer diverse VS profiles. All in all, we believe that the off-the-shelf descriptor-based models still can be directly employed to accurately predict various chemical endpoints with excellent computability and interpretability.

Highlights

  • Molecular property modelling, which assists in hunting for chemicals with desired pharmacological and ADME/T properties, is one of the most classical cheminformatics tasks [1, 2]

  • As to descriptor-based deep learning (DL) models, molecular descriptors and/or fingerprints commonly used in traditional quantitative structure–activity relationship (QSAR) models are used as the input, and a specific DL architecture is employed to train a model [25]

  • Three datasets were used for the regression tasks, including ESOL, FreeSolv, and Lipop, and the remaining eight datasets were used for the classification tasks, which can be further divided into the single-task datasets (ESOL, FreeSolv, Lipop, human immunodeficiency virus (HIV), BACE, and BBBP) and the multi-task datasets (CilnTox, SIDER, Tox21, ToxCast, and maximum unbiased validation (MUV))

Read more

Summary

Introduction

Molecular property modelling, which assists in hunting for chemicals with desired pharmacological and ADME/T (absorption, distribution, metabolism, excretion, and toxicity) properties, is one of the most classical cheminformatics tasks [1, 2]. As to descriptor-based DL models, molecular descriptors and/or fingerprints commonly used in traditional quantitative structure–activity relationship (QSAR) models are used as the input, and a specific DL architecture is employed to train a model [25]. GNN aims to learn the representations of each atom by aggregating the information from its neighboring atoms encoded by the atom feature vector and the information of the connected bonds encoded by the bond feature vector through message passing across the molecular graph recursively (Fig. 1), followed by the state updating of the central atoms and read-out operation. The key feature of GNN is its capacity to automatically learn task-specific representations using graph convolutions while does not need traditional handcrafted descriptors and/or fingerprints.

Materials and methods
AUC-ROC Binary labels of blood–brain barrier penetration
Results and discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call