Abstract

In silico protein–ligand binding prediction is an ongoing area of research in computational chemistry and machine learning based drug discovery, as an accurate predictive model could greatly reduce the time and resources necessary for the detection and prioritization of possible drug candidates. Proteochemometric modeling (PCM) attempts to create an accurate model of the protein–ligand interaction space by combining explicit protein and ligand descriptors. This requires the creation of information-rich, uniform and computer interpretable representations of proteins and ligands. Previous studies in PCM modeling rely on pre-defined, handcrafted feature extraction methods, and many methods use protein descriptors that require alignment or are otherwise specific to a particular group of related proteins. However, recent advances in representation learning have shown that unsupervised machine learning can be used to generate embeddings that outperform complex, human-engineered representations. Several different embedding methods for proteins and molecules have been developed based on various language-modeling methods. Here, we demonstrate the utility of these unsupervised representations and compare three protein embeddings and two compound embeddings in a fair manner. We evaluate performance on various splits of a benchmark dataset, as well as on an internal dataset of protein–ligand binding activities and find that unsupervised-learned representations significantly outperform handcrafted representations.

Highlights

  • A main goal of cheminformatics in the area of drug discovery is to model the interaction of small molecules with proteins in silico

  • By comparing performance improvement when using the full model versus the No-Interaction-Terms model between splits, we find that the full model improves performance over the No-Interaction-Terms model by, on average, 10% on the random split (p = 3.1 ∗ 10−10 ), 15% on the compound-cluster-out split (p = 1.7 ∗ 10−11 ) and 7% on the protein-out-split (p = 1.8 ∗ 10−8 )

  • The results show that unsupervised-learned descriptors offer significant improvements over handcrafted descriptors in Proteochemometric modeling (PCM)

Read more

Summary

Introduction

A main goal of cheminformatics in the area of drug discovery is to model the interaction of small molecules with proteins in silico. A common approach is to train a machine learning algorithm to predict the binding affinity of ligands towards a certain biological target by using a training set of compounds that have been experimentally measured on this target This modality is commonly referred to as a quantitative structure–activity–relationship (QSAR) model and uses the similarities and differences between molecules, represented in various ways, in order to learn patterns about their properties [2]. In multi-task modeling, a single model is trained to predict binding across multiple proteins simultaneously, allowing the model to take advantage of the correlations in binding activity between compounds on different targets [5,6]. Multiple outputs are predicted given a compound input [7,8]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.