Addendum: Resolving data bias improves generalization in binding affinity prediction
Addendum: Resolving data bias improves generalization in binding affinity prediction
- Research Article
20
- 10.1021/ci400045v
- May 21, 2013
- Journal of Chemical Information and Modeling
In this study, we use the recently released 2012 Community Structure-Activity Resource (CSAR) data set to evaluate two knowledge-based scoring functions, ITScore and STScore, and a simple force-field-based potential (VDWScore). The CSAR data set contains 757 compounds, most with known affinities, and 57 crystal structures. With the help of the script files for docking preparation, we use the full CSAR data set to evaluate the performances of the scoring functions on binding affinity prediction and active/inactive compound discrimination. The CSAR subset that includes crystal structures is used as well, to evaluate the performances of the scoring functions on binding mode and affinity predictions. Within this structure subset, we investigate the importance of accurate ligand and protein conformational sampling and find that the binding affinity predictions are less sensitive to non-native ligand and protein conformations than the binding mode predictions. We also find the full CSAR data set to be more challenging in making binding mode predictions than the subset with structures. The script files used for preparing the CSAR data set for docking, including scripts for canonicalization of the ligand atoms, are offered freely to the academic community.
- Research Article
7
- 10.1002/prot.24366
- Sep 14, 2013
- Proteins: Structure, Function, and Bioinformatics
Predictions of protein-protein binders and binding affinities have traditionally focused on features pertaining to the native complexes. In developing a computational method for predicting protein-protein association rate constants, we introduced the concept of transient complex after mapping the interaction energy surface. The transient complex is located at the outer boundary of the bound-state energy well, having near-native separation and relative orientation between the subunits but not yet formed most of the short-range native interactions. We found that the width of the binding funnel and the electrostatic interaction energy of the transient complex are among the features predictive of binders and binding affinities. These ideas were very promising for the five affinity-related targets (T43-45, 55, and 56) of CAPRI rounds 20-27. For T43, we ranked the single crystallographic complex as number 1 and were one of only two groups that clearly identified that complex as a true binder; for T44, we ranked the only design with measurable binding affinity as number 4. For the nine docking targets, continuing on our success in previous CAPRI rounds, we produced 10 medium-quality models for T47 and acceptable models for T48 and T49. We conclude that the interaction energy landscape and the transient complex in particular will complement existing features in leading to better prediction of binding affinities.
- Research Article
11
- 10.1007/s11030-008-9069-9
- Aug 1, 2007
- Molecular Diversity
We report a neural network modeling approach combined with genetic algorithm for prediction of experimental binding affinity to human Estrogen Receptor alpha and beta (ER-alpha and ER-beta) of a diverse set of chemicals. The counterpropagation artificial neural network is used as a modeling method. Structural features of ligands having the strongest influence to the binding affinities were investigated. The molecular descriptors have been selected in the variable selection procedure based on the genetic algorithm (GA). The 3D descriptors of molecular structures were calculated for the minimal energy conformation of isolated ligands. All the optimized models were tested by an internal and an external set of compounds. The models served for classification and prediction of binding affinities. The optimized models were 100% correct in the classification part, where the active molecules were separated from the inactive ones. The best predictive model of active molecules was assessed with the internal test set yielding the error in prediction RMS = 0.12, while the predictions for the external test set contain some outliers, which are ascribed to the incompatibility of individual compounds concerning the structural domain of our model. The influence of the receptor on the conformation of the ligands in the ligand-protein complex is described and discussed in the accompanying paper.
- Research Article
21
- 10.1016/j.csbj.2023.11.009
- Jan 1, 2023
- Computational and Structural Biotechnology Journal
Prediction of protein-ligand binding affinity with deep learning
- Research Article
16
- 10.1016/j.artmed.2010.05.003
- Jun 11, 2010
- Artificial Intelligence in Medicine
Quantitative prediction of MHC-II binding affinity using particle swarm optimization
- Research Article
1
- 10.1101/2023.11.16.567384
- Oct 21, 2024
- bioRxiv : the preprint server for biology
Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions internuclear persistent contours (IPCs) . Next, we introduce persistence fingerprints , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be n , number of ligand atoms be m , and ω ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 < ε < 1, after an 𝒪 ( mn log( mn )) preprocessing procedure, we can compute an ε -accurate approximation to the persistence fingerprint in 𝒪 ( m log 6 ω ( m/ε )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( m + n ) 3 ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce PATH , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH + and PATH - . PATH + is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH + has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH + has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH - , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.
- Research Article
12
- 10.1016/j.ijbiomac.2024.129490
- Jan 13, 2024
- International Journal of Biological Macromolecules
PRA-Pred: Structure-based prediction of protein-RNA binding affinity
- Research Article
48
- 10.1186/s12859-016-1169-4
- Sep 1, 2016
- BMC Bioinformatics
BackgroundPose generation error is usually quantified as the difference between the geometry of the pose generated by the docking software and that of the same molecule co-crystallised with the considered protein. Surprisingly, the impact of this error on binding affinity prediction is yet to be systematically analysed across diverse protein-ligand complexes.ResultsAgainst commonly-held views, we have found that pose generation error has generally a small impact on the accuracy of binding affinity prediction. This is also true for large pose generation errors and it is not only observed with machine-learning scoring functions, but also with classical scoring functions such as AutoDock Vina. Furthermore, we propose a procedure to correct a substantial part of this error which consists of calibrating the scoring functions with re-docked, rather than co-crystallised, poses. In this way, the relationship between Vina-generated protein-ligand poses and their binding affinities is directly learned. As a result, test set performance after this error-correcting procedure is much closer to that of predicting the binding affinity in the absence of pose generation error (i.e. on crystal structures). We evaluated several strategies, obtaining better results for those using a single docked pose per ligand than those using multiple docked poses per ligand.ConclusionsBinding affinity prediction is often carried out on the docked pose of a known binder rather than its co-crystallised pose. Our results suggest than pose generation error is in general far less damaging for binding affinity prediction than it is currently believed. Another contribution of our study is the proposal of a procedure that largely corrects for this error. The resulting machine-learning scoring function is freely available at http://istar.cse.cuhk.edu.hk/rf-score-4.tgz and http://ballester.marseille.inserm.fr/rf-score-4.tgz.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-016-1169-4) contains supplementary material, which is available to authorized users.
- Research Article
4
- 10.1002/prot.26827
- Apr 8, 2025
- Proteins
Predicting the structure of ligands bound to proteins is a foundational problem in modern biotechnology and drug discovery, yet little is known about how to combine the predictions of protein-ligand structure (poses) produced by the latest deep learning methods to identify the best poses and how to accurately estimate the binding affinity between a protein target and a list of ligand candidates. Further, a blind benchmarking and assessment of protein-ligand structure and binding affinity prediction is necessary to ensure it generalizes well to new settings. Towards this end, we introduce MULTICOM_ligand, a deep learning-based protein-ligand structure and binding affinity prediction ensemble featuring structural consensus ranking for unsupervised pose ranking and a new deep generative flow matching model for joint structure and binding affinity prediction. Notably, MULTICOM_ligand ranked among the top-5 ligand prediction methods in both protein-ligand structure prediction and binding affinity prediction in the 16th Critical Assessment of Techniques for Structure Prediction (CASP16), demonstrating its efficacy and utility for real-world drug discovery efforts. The source code for MULTICOM_ligand is freely available on GitHub.
- Research Article
67
- 10.1002/bip.21091
- Jan 1, 2008
- Peptide Science
In this article, we discuss the application of the Gaussian process (GP) and other statistical methods (PLS, ANN, and SVM) for the modeling and prediction of binding affinities between the human amphiphysin SH3 domain and its peptide ligands. Divided physicochemical property scores of amino acids, involving significant hydrogen bond, electronic, hydrophobic, and steric properties, was used to characterize the peptide structures, and quantitative structure-affinity relationship models were then constructed by PLS, ANN, SVM, and GP coupled with genetic algorithm-variable selection. The results show that: (i) since the significant flexibility and high complexity possessed in polypeptide structures, linear PLS method was incapable of fulfilling a satisfying behavior on SH3 domain binding peptide dataset; (ii) the overfitting involved in training process has decreased the predictive power of ANN model to some extent; (iii) both SVM and GP have a good performance for SH3 domain binding peptide dataset. Moreover, by combining linear and nonlinear terms in the covariance function, the GP is capable of handling linear and nonlinear-hybrid relationship, and which thus obtained a more stable and predictable model than SVM. Analyses of GP models showed that diversified properties contribute remarkable effect to the interactions between the SH3 domain and the peptides. Particularly, steric property and hydrophobicity of P(2), electronic property of P(0), and electronic property and hydrogen bond property of P(-3) in decapeptide (P(4)P(3)P(2)P(1)P(0)P(-1)P(-2)P(-3)P(-4)P(-5)) significantly contribute to the binding affinities of SH3 domain-peptide interactions.
- Research Article
19
- 10.1007/s10822-011-9529-7
- Dec 25, 2011
- Journal of Computer-Aided Molecular Design
We carried out a prospective evaluation of the utility of the SIE (solvation interaction energy) scoring function for virtual screening and binding affinity prediction. Since experimental structures of the complexes were not provided, this was an exercise in virtual docking as well. We used our exhaustive docking program, Wilma, to provide high-quality poses that were rescored using SIE to provide binding affinity predictions. We also tested the combination of SIE with our latest solvation model, first shell of hydration (FiSH), which captures some of the discrete properties of water within a continuum model. We achieved good enrichment in virtual screening of fragments against trypsin, with an area under the curve of about 0.7 for the receiver operating characteristic curve. Moreover, the early enrichment performance was quite good with 50% of true actives recovered with a 15% false positive rate in a prospective calculation and with a 3% false positive rate in a retrospective application of SIE with FiSH. Binding affinity predictions for both trypsin and host-guest complexes were generally within 2 kcal/mol of the experimental values. However, the rank ordering of affinities differing by 2 kcal/mol or less was not well predicted. On the other hand, it was encouraging that the incorporation of a more sophisticated solvation model into SIE resulted in better discrimination of true binders from binders. This suggests that the inclusion of proper Physics in our models is a fruitful strategy for improving the reliability of our binding affinity predictions.
- Research Article
2
- 10.1186/s12859-022-05107-w
- Dec 16, 2022
- BMC Bioinformatics
BackgroundCompound–protein interaction site and binding affinity predictions are crucial for drug discovery and drug design. In recent years, many deep learning-based methods have been proposed for predications related to compound–protein interaction. For protein inputs, how to make use of protein primary sequence and tertiary structure information has impact on prediction results.ResultsIn this study, we propose a deep learning model based on a multi-objective neural network, which involves a multi-objective neural network for compound–protein interaction site and binding affinity prediction. We used several kinds of self-supervised protein embeddings to enrich our protein inputs and used convolutional neural networks to extract features from them. Our results demonstrate that our model had improvements in terms of interaction site prediction and affinity prediction compared to previous models. In a case study, our model could better predict binding sites, which also showed its effectiveness.ConclusionThese results suggest that our model could be a helpful tool for compound–protein related predictions.
- Preprint Article
- 10.21203/rs.3.rs-3675013/v2
- Jan 17, 2025
Binding affinity prediction is pivotal in drug design, offering insights into the interactions between ligands and protein targets and thereby significantly influencing the drug development pipeline. Its potential to expedite the identification of drug candidates has led to extensive research focused on developing machine learning algorithms for predicting binding affinity. However, most developments have concentrated on independently and identically distributed (i.i.d) data. In real-world scenarios, prediction models may encounter novel chemical substructures, protein families absent from the training set, variations in experimental conditions, and evolving drug resistance mechanisms. These factors can lead to a significant degradation in performance, causing models to suggest suboptimal compounds or overlook promising candidates—challenges commonly referred to as Out-of-Domain (OOD) in the machine learning community. To address the OOD challenges in binding affinity algorithm development, several benchmarks have been introduced. However, we observe that many lack a convenient codebase framework for swift algorithm evaluation.In this paper, building upon the DrugOOD dataset, we introduce a comprehensive benchmarking framework to assess the resilience and adaptability of OOD algorithms in binding affinity prediction. Our framework offers a streamlined approach for evaluating algorithmic performance in OOD scenarios. Furthermore, we propose a method that surpasses existing state-of-the-art approaches in our benchmark tests. We anticipate that our contributions will spur further research addressing OOD challenges and enhance the reliability and robustness of binding affinity predictions in drug design. Code available at: https://github.com/zehanzz/BioFrontierOOD.git
- Preprint Article
4
- 10.26434/chemrxiv.9866912.v1
- Sep 23, 2019
Introduction: The ability to discriminate among ligands binding to the same protein target in terms of their relative binding affinity lies at the heart of structure-based drug design. Any improvement in the accuracy and reliability of binding affinity prediction methods decreases the discrepancy between experimental and computational results.Objectives: The primary objectives were to find the most relevant features affecting binding affinity prediction, least use of manual feature engineering, and improving the reliability of binding affinity prediction using efficient deep learning models by tuning the model hyperparameters.Methods: The binding site of target proteins was represented as a grid box around their bound ligand. Both binary and distance-dependent occupancies were examined for how an atom affects its neighbor voxels in this grid. A combination of different features including ANOLEA, ligand elements, and Arpeggio atom types were used to represent the input. An efficient convolutional neural network (CNN) architecture, DeepAtom, was developed, trained and tested on the PDBbind v2016 dataset. Additionally an extended benchmark dataset was compiled to train and evaluate the models.Results: The best DeepAtom model showed an improved accuracy in the binding affinity prediction on PDBbind core subset (Pearson’s R=0.83) and is better than the recent state-of-the-art models in this field. In addition when the DeepAtom model was trained on our proposed benchmark dataset, it yields higher correlation compared to the baseline which confirms the value of our model.Conclusions: The promising results for the predicted binding affinities is expected to pave the way for embedding deep learning models in virtual screening and rational drug design fields.
- Research Article
98
- 10.3390/ijms21228424
- Nov 10, 2020
- International journal of molecular sciences
Accurate prediction of the binding affinity of a protein-ligand complex is essential for efficient and successful rational drug design. Therefore, many binding affinity prediction methods have been developed. In recent years, since deep learning technology has become powerful, it is also implemented to predict affinity. In this work, a new neural network model that predicts the binding affinity of a protein-ligand complex structure is developed. Our model predicts the binding affinity of a complex using the ensemble of multiple independently trained networks that consist of multiple channels of 3-D convolutional neural network layers. Our model was trained using the 3772 protein-ligand complexes from the refined set of the PDBbind-2016 database and tested using the core set of 285 complexes. The benchmark results show that the Pearson correlation coefficient between the predicted binding affinities by our model and the experimental data is 0.827, which is higher than the state-of-the-art binding affinity prediction scoring functions. Additionally, our method ranks the relative binding affinities of possible multiple binders of a protein quite accurately, comparable to the other scoring functions. Last, we measured which structural information is critical for predicting binding affinity and found that the complementarity between the protein and ligand is most important.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.