The impact of cross-docked poses on performance of machine learning classifier for protein\u2013ligand binding pose prediction

Chao Shen,Haiyang Zhong,Tingjun Hou,Yu Kang,Xujun Zhang,Xueping Hu,Dongsheng Cao,Zhe Wang,Lei Xu,Junbo Gao

doi:10.1186/s13321-021-00560-w

Abstract

Structure-based drug design depends on the detailed knowledge of the three-dimensional (3D) structures of protein–ligand binding complexes, but accurate prediction of ligand-binding poses is still a major challenge for molecular docking due to deficiency of scoring functions (SFs) and ignorance of protein flexibility upon ligand binding. In this study, based on a cross-docking dataset dedicatedly constructed from the PDBbind database, we developed several XGBoost-trained classifiers to discriminate the near-native binding poses from decoys, and systematically assessed their performance with/without the involvement of the cross-docked poses in the training/test sets. The calculation results illustrate that using Extended Connectivity Interaction Features (ECIF), Vina energy terms and docking pose ranks as the features can achieve the best performance, according to the validation through the random splitting or refined-core splitting and the testing on the re-docked or cross-docked poses. Besides, it is found that, despite the significant decrease of the performance for the threefold clustered cross-validation, the inclusion of the Vina energy terms can effectively ensure the lower limit of the performance of the models and thus improve their generalization capability. Furthermore, our calculation results also highlight the importance of the incorporation of the cross-docked poses into the training of the SFs with wide application domain and high robustness for binding pose prediction. The source code and the newly-developed cross-docking datasets can be freely available at https://github.com/sc8668/ml_pose_prediction and https://zenodo.org/record/5525936, respectively, under an open-source license. We believe that our study may provide valuable guidance for the development and assessment of new machine learning-based SFs (MLSFs) for the predictions of protein–ligand binding poses.

Highlights

Molecular docking is one of the core technologies in structure-based drug design (SBDD), and it has contributed enormously to drug discovery and development in the past decades [1,2,3]
Comparison based on the random splitting of the re‐docked poses The performance of our Machine learning-based scoring function (MLSF) was first evaluated by the random splitting of the PDBbind-ReDocked dataset
When both the training and test sets contain the re-docked poses, our MLSFs can surely exhibit superior performance to the classical methods, whichever based on the random splitting or refined-core splitting, or even tested on the dataset where the poses own a large coverage of RMSD distribution (i.e., Comparative Assessment of Scoring Functions (CASF)-docking)

Summary

Introduction

Molecular docking is one of the core technologies in structure-based drug design (SBDD), and it has contributed enormously to drug discovery and development in the past decades [1,2,3]. Molecular docking has two stages: (1) sampling the pose of the ligand in the binding site of a macromolecular target (usually a protein) and (2) scoring the binding strength of the ligand to the target by using a predefined scoring function (SF). Depending on different theoretical principles, existing SFs can be typically divided into four main groups: physics-based, empirical, knowledge-based, and newlyemerged machine learning-based SFs (MLSFs) [6]. The former three can be collectively referred to as classical SFs, since all of them follow an additive formulated hypothesis to represent the relationship between the features that characterize protein–ligand interactions and experimental bioactivities. During the past few years, extensive efforts have been made towards the development of MLSFs, ranging from traditional ML-based approaches (e.g., RF-Score [7,8,9], NNScore [10, 11], MIEC-SVM [12], ΔVinaRF20 [13], and AGL-Score [14]) to recently-emerged deep learning (DL)-based methods (e.g., AtomNet [15], DeepVS [16], KDEEP [17], and PotentialNet [18]), and most of these MLSFs demonstrate remarkably superior performance compared with the classical SFs [19,20,21,22]

Methods

Results

Conclusion