Abstract

BackgroundAccurately predicting the binding affinities of large sets of protein-ligand complexes is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank, and identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a protein's binding site has a significant bearing on the accuracy of virtual screening. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited predictive power has been a major roadblock toward cost-effective drug discovery. Therefore, in this work, we present novel SFs employing a large ensemble of neural networks (NN) in conjunction with a diverse set of physicochemical and geometrical features characterizing protein-ligand complexes to predict binding affinity.ResultsWe assess the scoring accuracies of two new ensemble NN SFs based on bagging (BgN-Score) and boosting (BsN-Score), as well as those of conventional SFs in the context of the 2007 PDBbind benchmark that encompasses a diverse set of high-quality protein families. We find that BgN-Score and BsN-Score have more than 25% better Pearson's correlation coefficient (0.804 and 0.816 vs. 0.644) between predicted and measured binding affinities compared to that achieved by a state-of-the-art conventional SF. In addition, these ensemble NN SFs are also at least 19% more accurate (0.804 and 0.816 vs. 0.675) than SFs based on a single neural network that has been traditionally used in drug discovery applications. We further find that ensemble models based on NNs surpass SFs based on the decision-tree ensemble technique Random Forests.ConclusionsEnsemble neural networks SFs, BgN-Score and BsN-Score, are the most accurate in predicting binding affinity of protein-ligand complexes among the considered SFs. Moreover, their accuracies are even higher when they are used to predict binding affinities of protein-ligand complexes that are related to their training sets.

Highlights

  • Predicting the binding affinities of large sets of protein-ligand complexes is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology

  • This involves docking tens of thousands to millions of ligand candidates into a target protein receptor’s binding site and using a suitable scoring function (SF) to evaluate the binding affinity of each candidate to identify the top candidates as drug leads, and to perform lead optimization [2]; it is used for target identification [4]

  • We recently proposed random forests (RF), boosted regression trees (BRT), support vector machines (SVM), k-nearest neighbors, and multivariate adaptive regression splines (MARS) nonlinear scoring functions and compared their ligand scoring and ranking performances against the sixteen conventional SFs considered by Cheng et al on the same benchmark test sets [16,17]

Read more

Summary

Introduction

Predicting the binding affinities of large sets of protein-ligand complexes is a key challenge in computational biomolecular science, with applications in drug discovery, chemical biology, and structural biology. Due to prohibitive costs and delays associated with experimental drug discovery, pharmaceutical and biotechnology companies rely on virtual screening using computational molecular docking [1,2,3]. This involves docking tens of thousands to millions of ligand candidates into a target protein receptor’s binding site and using a suitable scoring function (SF) to evaluate the binding affinity of each candidate to identify the top candidates as drug leads, and to perform lead optimization [2]; it is used for target identification [4]. It has become attractive because of the ever-increasing number of available receptor protein structures and putative ligand drug candidates in publicly-accessible databases, such as the Protein Data Bank (PDB) [8], PDBbind [9], Cambridge Structural Database (CSD) [10], and corporate repositories

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call