Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?

Minyi Su,Yan Li,Guoqin Feng,Zhihai Liu,Renxiao Wang

doi:10.1021/acs.jcim.9b00714

Abstract

In recent years, protein-ligand interaction scoring functions derived through machine-learning are repeatedly reported to outperform conventional scoring functions. However, several published studies have questioned that the superior performance of machine-learning scoring functions is dependent on the overlap between the training set and the test set. In order to examine the true power of machine-learning algorithms in scoring function formulation, we have conducted a systematic study of six off-the-shelf machine-learning algorithms, including Bayesian Ridge Regression (BRR), Decision Tree (DT), K-Nearest Neighbors (KNN), Multilayer Perceptron (MLP), Linear Support Vector Regression (L-SVR), and Random Forest (RF). Model scoring functions were derived with these machine-learning algorithms on various training sets selected from over 3700 protein-ligand complexes in the PDBbind refined set (version 2016). All resulting scoring functions were then applied to the CASF-2016 test set to validate their scoring power. In our first series of trial, the size of the training set was fixed; while the overall similarity between the training set and the test set was varied systematically. In our second series of trial, the overall similarity between the training set and the test set was fixed, while the size of the training set was varied. Our results indicate that the performance of those machine-learning models are more or less dependent on the contents or the size of the training set, where the RF model demonstrates the best learning capability. In contrast, the performance of three conventional scoring functions (i.e., ChemScore, ASP, and X-Score) is basically insensitive to the use of different training sets. Therefore, one has to consider not only "hard overlap" but also "soft overlap" between the training set and the test set in order to evaluate machine-learning scoring functions. In this spirit, we have complied data sets based on the PDBbind refined set by removing redundant samples under several similarity thresholds. Scoring functions developers are encouraged to employ them as standard training sets if they want to evaluate their new models on the CASF-2016 benchmark.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling

Lead the way for us

Journal: Journal of Chemical Information and Modeling	Publication Date: Feb 21, 2020
Citations: 59

Similar Papers

Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins.
Hossam M Ashtawy ... Nihar R Mahapatra
BMC Bioinformatics | VOL. Suppl 16 6
Hossam M Ashtawy, et. al.Hossam M Ashtawy ... Nihar R Mahapatra
17 Apr 2015
BMC Bioinformatics | VOL. Suppl 16 6

Molecular Docking for Drug Discovery: Machine-Learning Approaches for Native Pose Prediction of Protein-Ligand Complexes
Hossam M Ashtawy ... Nihar R Mahapatra
-
Hossam M Ashtawy, et. al.Hossam M Ashtawy ... Nihar R Mahapatra
01 Jan 2014
01 Jan 2014

A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction
Hossam M Ashtawy ... Nihar R Mahapatra
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 12
Hossam M Ashtawy, et. al.Hossam M Ashtawy ... Nihar R Mahapatra
01 Mar 2015
IEEE/ACM Transactions on Computational Biology and Bioinformatics | VOL. 12

Pushing the limits of solubility prediction via quality-oriented data selection.
Murat Cihan Sorkun ... Süleyman Er
iScience | VOL. 24
Murat Cihan Sorkun, et. al.Murat Cihan Sorkun ... Süleyman Er
17 Dec 2020
iScience | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Tapping on the Black Box: How Is the Scoring Power of a Machine-Learning Scoring Function Dependent on the Training Set?

Abstract

Talk to us

Similar Papers

More From: Journal of Chemical Information and Modeling