Abstract

BackgroundMolecular docking is a widely-employed method in structure-based drug design. An essential component of molecular docking programs is a scoring function (SF) that can be used to identify the most stable binding pose of a ligand, when bound to a receptor protein, from among a large set of candidate poses. Despite intense efforts in developing conventional SFs, which are either force-field based, knowledge-based, or empirical, their limited docking power (or ability to successfully identify the correct pose) has been a major impediment to cost-effective drug discovery. Therefore, in this work, we explore a range of novel SFs employing different machine-learning (ML) approaches in conjunction with physicochemical and geometrical features characterizing protein-ligand complexes to predict the native or near-native pose of a ligand docked to a receptor protein's binding site. We assess the docking accuracies of these new ML SFs as well as those of conventional SFs in the context of the 2007 PDBbind benchmark dataset on both diverse and homogeneous (protein-family-specific) test sets. Further, we perform a systematic analysis of the performance of the proposed SFs in identifying native poses of ligands that are docked to novel protein targets.Results and conclusionWe find that the best performing ML SF has a success rate of 80% in identifying poses that are within 1 Å root-mean-square deviation from the native poses of 65 different protein families. This is in comparison to a success rate of only 70% achieved by the best conventional SF, ASP, employed in the commercial docking software GOLD. In addition, the proposed ML SFs perform better on novel proteins that they were never trained on before. We also observed steady gains in the performance of these scoring functions as the training set size and number of features were increased by considering more protein-ligand complexes and/or more computationally-generated poses for each complex.

Highlights

  • Bringing a new drug to market is a complex process that costs hundreds of millions of dollars and spans over ten years of research, development, and testing

  • We only report the version and/or option that yields the best performance on the PDBbind benchmark that was considered by Cheng et al Machine learning methods We utilize a total of six regression techniques in our study: multiple linear regression (MLR), multivariate adaptive regression splines (MARS), k-nearest neighbors, support vector machines (SVM), random forests (RF), and boosted regression trees (BRT) [26]

  • We found that ML models trained to explicitly predict root-mean-square deviation (RMSD) values significantly outperform all conventional scoring function (SF) in almost all testing scenarios

Read more

Summary

Introduction

Background Bringing a new drug to market is a complex process that costs hundreds of millions of dollars and spans over ten years of research, development, and testing. The most popular approach to predicting the correct binding pose and binding affinity (BA) in virtual screening is structure-based in which physicochemical interactions between a ligand and receptor are deduced from the 3D structures of both molecules. This docking and scoring step is performed iteratively over a database containing thousands to millions of ligand candidates. After predicting their binding poses, another scoring round is performed to rank ligands according to their predicted binding free energies. We perform a systematic analysis of the performance of the proposed SFs in identifying native poses of ligands that are docked to novel protein targets

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call