Abstract

Drug discovery is incredibly time-consuming and expensive, averaging over 10 years and $985 million per drug. Calculating the binding affinity between a target protein and a ligand through Virtual Screening is critical for discovering viable drugs. Although supervised machine learning (ML) can predict binding affinity accurately, models experience severe overfitting due to an inability to identify informative properties of protein-ligand complexes. This study used unsupervised ML to reveal underlying protein-ligand characteristics that strongly influence binding affinity. Protein-ligand 3D models were collected from the PDBBind database and vectorized into 2422 features per complex. Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), K-Means Clustering, and heatmaps were used to identify groups of complexes and the features responsible for the separation. ML benchmarking was used to determine the features’ effect on ML performance. The PCA heatmap revealed groups of complexes with binding affinity of pKd<6 and pKd>8 and identified the number of CCCH and CCCCCH fragments in the ligand as the most responsible features. A high correlation of 0.8337, their ability to explain 18% of the binding affinity’s variance, and an error increase of 0.09 in Decision Trees when trained without the two features suggests that the fragments exist within a larger ligand substructure that significantly influences binding affinity. This discovery is a baseline for informative ligand representations to be generated so that ML models overfit less and can more reliably identify novel drug candidates. Future work will focus on validating the ligand substructure’s presence and discovering more informative intra-ligand relationships.

Highlights

  • Drug discovery is the basis of the modern pharmaceutical market and encompasses most of the industry’s research and development funding [1]

  • Including the relationship elucidated through this work, more interactions can be gathered to develop a corpus of ligand fragment relationships that influence binding affinity

  • Most importantly, uncovering specific ligand relationships will result in machine learning (ML) models that overfit less, making them more generalizable to new datasets and reliable for analyzing novel drug candidates [37,38,39]

Read more

Summary

Introduction

Drug discovery is the basis of the modern pharmaceutical market and encompasses most of the industry’s research and development funding [1] On average, it takes 12-15 years and $985 million to deliver a drug to market, demonstrating the exhaustive time and effort required to complete the drug discovery process [2, 3]. Drug-Target Interaction (DTI) analysis is one of the most critical parts of drug discovery, and it involves calculating the binding affinity between a target protein and a ligand molecule so that appropriate ligand candidates for drugs can be chosen. These ligand candidates go on to be included in in vitro experimentation in order to identify lead compounds for the final drug. Calculating the binding affinity between a protein and ligand can be completed through Virtual Screening (VS), shown in Fig. 2, where compounds are screened and binding affinity calculated using molecular simulation software [5]

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call