Abstract
Computational prediction of the interaction between drugs and targets is a standing challenge in the field of drug discovery. A number of rather accurate predictions were reported for various binary drug–target benchmark datasets. However, a notable drawback of a binary representation of interaction data is that missing endpoints for non-interacting drug–target pairs are not differentiated from inactive cases, and that predicted levels of activity depend on pre-defined binarization thresholds. In this paper, we present a method called SimBoost that predicts continuous (non-binary) values of binding affinities of compounds and proteins and thus incorporates the whole interaction spectrum from true negative to true positive interactions. Additionally, we propose a version of the method called SimBoostQuant which computes a prediction interval in order to assess the confidence of the predicted affinity, thus defining the Applicability Domain metrics explicitly. We evaluate SimBoost and SimBoostQuant on two established drug–target interaction benchmark datasets and one new dataset that we propose to use as a benchmark for read-across cheminformatics applications. We demonstrate that our methods outperform the previously reported models across the studied datasets.
Highlights
Finding a compound that selectively binds to a particular protein is a highly challenging and typically expensive procedure in the drug development process, where more than 90% of candidate compounds fail due to crossreactivity and/or toxicity issues
We introduce Matrix Factorization as it was used in the literature for binary drug–target interaction prediction and as it plays an important role in our proposed method
We propose a version of SimBoost, called SimBoostQuant, which computes the confidence of the prediction by using quantile regression to learn a prediction interval for a given drug–target pair as a measure of the confidence of the prediction
Summary
Finding a compound that selectively binds to a particular protein is a highly challenging and typically expensive procedure in the drug development process, where more than 90% of candidate compounds fail due to crossreactivity and/or toxicity issues. It is an important topic in drug research to gain knowledge about the interaction of compounds and target proteins through computational methods Such in silico approaches are capable of speeding up the experimental wet lab work by systematically prioritizing the most potent compounds and help predicting their potential side effects. The datasets commonly used for the training and evaluation of such machine learning-based prediction methods are the Enzymes, Ion Channels, Nuclear Receptor, and G Protein-Coupled Receptor datasets [3] These datasets contain binary labels Y(i,j) = 1 if drug–target pair (di, tj) is known to interact (as shown by wet lab experiments) and Y(i,j) = 0 if either (di, tj) is known to not interact or if the interaction of (di, tj) is unknown. The datasets tend to be biased towards drugs and targets that are considered to be more important or easier
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.