Abstract

Given the abundant computational resources and the huge amount of data of compound–protein interactions (CPIs), constructing appropriate datasets for learning and evaluating prediction models for CPIs is not always easy. For this study, we have developed a web server to facilitate the development and evaluation of prediction models by providing an appropriate dataset according to the task. Our web server provides an environment and dataset that aid model developers and evaluators in obtaining a suitable dataset for both proteins and compounds, in addition to attributes necessary for deep learning. With the web server interface, users can customize the CPI dataset derived from ChEMBL by setting positive and negative thresholds to be adjusted according to the user’s definitions. We have also implemented a function for graphic display of the distribution of activity values in the dataset as a histogram to set appropriate thresholds for positive and negative examples. These functions enable effective development and evaluation of models. Furthermore, users can prepare their task-specific datasets by selecting a set of target proteins based on various criteria such as Pfam families, ChEMBL’s classification, and sequence similarities. The accuracy and efficiency of in silico screening and drug design using machine learning including deep learning can therefore be improved by facilitating access to an appropriate dataset prepared using our web server (https://binds.lifematics.work/).

Highlights

  • Identification of disease-causing proteins and compounds that act on those diseases is an important starting point in the drug discovery process (Hughes et al, 2011)

  • Machine learning (ML) methods using compound–protein interactions (CPIs) data have already been regarded as effective means for the hit-to-lead stage (Ghasemi et al, 2018; Ferreira and Andricopulo, 2019)

  • We have developed a web server that simplifies creation of CPI datasets for the development and evaluation of prediction models

Read more

Summary

Introduction

Identification of disease-causing proteins and compounds that act on those diseases is an important starting point in the drug discovery process (Hughes et al, 2011). Improving drug development efficiency using known CPI data is necessary because it can shorten times to market and reduce costs. Machine learning (ML) methods using CPI data have already been regarded as effective means for the hit-to-lead stage (Ghasemi et al, 2018; Ferreira and Andricopulo, 2019). AI prediction models have already been applied to various issues; further enhanced efficiency of drug discovery is expected (Tsubaki et al, 2019; Beker et al, 2020; Kojima et al, 2020). In the field of ML-based CPI prediction research, some widely used benchmark datasets and development methods have been proposed (He et al, 2017; Wu et al, 2018; Rifaioglu et al, 2020, 2021)

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call