Improving detection of protein-ligand binding sites with 3D segmentation

Marta M Stepniewska-Dziubinska,Piotr Zielenkiewicz,Pawel Siedlecki

doi:10.1038/s41598-020-61860-z

Marta M Stepniewska-Dziubinska, Piotr Zielenkiewicz + Show 1 more

Open Access

https://doi.org/10.1038/s41598-020-61860-z

Copy DOI

Abstract

In recent years machine learning (ML) took bio- and cheminformatics fields by storm, providing new solutions for a vast repertoire of problems related to protein sequence, structure, and interactions analysis. ML techniques, deep neural networks especially, were proven more effective than classical models for tasks like predicting binding affinity for molecular complex. In this work we investigated the earlier stage of drug discovery process – finding druggable pockets on protein surface, that can be later used to design active molecules. For this purpose we developed a 3D fully convolutional neural network capable of binding site segmentation. Our solution has high prediction accuracy and provides intuitive representations of the results, which makes it easy to incorporate into drug discovery projects. The model’s source code, together with scripts for most common use-cases is freely available at http://gitlab.com/cheminfIBB/kalasanty.

Highlights

The aim of rational drug design is to discover new drugs faster and cheaper
We show results for the model trained on the whole training set and evaluated on the test set. We compare it to another Deep learning (DL)-based approach – DeepSite
In this work we presented Kalasanty – a neural network model for detecting binding cavities on protein surfaces

Summary

Introduction

The aim of rational drug design is to discover new drugs faster and cheaper. Much of the effort is put into improving docking and scoring methodologies. The reverse approach is used in P2RANK14, which uses a random forest (RF) model to predict "ligandibility” score for each point on a protein’s surface, to cluster points with high scores The latter tool is an example of applying machine learning (ML) to detect pockets – supervised ML to score surface points and unsupervised ML to post-process these predictions. The data is relatively readily available (in case of P2RANK – the structure of a protein) but the desired information is typically much harder to acquire (e.g. location of binding sites). Another axis of classification of ML models is based on their complexity, or depth. In the context of bio- and cheminformatics DL allows to predict in silico properties that require much effort to establish experimentally, like detecting functional motives in sequences[15] or assessing binding affinity for protein-ligand complexes[16,17]

Methods

Results

Discussion

Conclusion