Similarity-Based Methods and Machine Learning Approaches for Target Prediction in Early Drug Discovery: Performance and Scope.

Neann Mathai,Johannes Kirchmair

doi:10.3390/ijms21103585

Abstract

Computational methods for predicting the macromolecular targets of drugs and drug-like compounds have evolved as a key technology in drug discovery. However, the established validation protocols leave several key questions regarding the performance and scope of methods unaddressed. For example, prediction success rates are commonly reported as averages over all compounds of a test set and do not consider the structural relationship between the individual test compounds and the training instances. In order to obtain a better understanding of the value of ligand-based methods for target prediction, we benchmarked a similarity-based method and a random forest based machine learning approach (both employing 2D molecular fingerprints) under three testing scenarios: a standard testing scenario with external data, a standard time-split scenario, and a scenario that is designed to most closely resemble real-world conditions. In addition, we deconvoluted the results based on the distances of the individual test molecules from the training data. We found that, surprisingly, the similarity-based approach generally outperformed the machine learning approach in all testing scenarios, even in cases where queries were structurally clearly distinct from the instances in the training (or reference) data, and despite a much higher coverage of the known target space.

Highlights

Computational methods for predicting the macromolecular targets of small molecules have become increasingly relevant and popular in recent years due to (i) the shift from the “one-drug-one-target” paradigm to “polypharmacology” [1,2,3,4,5], (ii) the increasing availability of chemical and biological data [6,7,8] and (iii) advances in algorithms and hardware technology
By analyzing the performance of the approaches under three different scenarios and deconvoluting the results based on the distance of the test compounds from the training data, we obtained a robust and differentiated picture of the performance and reach of the approaches
Under the standard testing scenario with external data, the percentage of queries for which their target was recovered among the top-5 out of 1798 positions was 88% for the similarity-based approach and 85% for the machine learning (ML) approach

Summary

Introduction

Computational methods for predicting the macromolecular targets of small molecules have become increasingly relevant and popular in recent years due to (i) the shift from the “one-drug-one-target” paradigm to “polypharmacology” [1,2,3,4,5], (ii) the increasing availability of chemical and biological data [6,7,8] and (iii) advances in algorithms and hardware technology. Ligand-based methods range from straightforward similarity-based approaches [13,14,15,16,17,18,19,20,21] and linear regressions [22] to more complex machine learning (ML) models such as random forests [23,24,25], support vector machines [25,26,27], self-organizing maps [28], neural and deep neural networks [25,29,30,31,32,33,34], and network-based models [35,36,37,38] They typically use large amounts of chemical information and measured bioactivity data [12] and, as a result, have a larger coverage of the target space when compared to structure-based methods, which rely on 3D structures of macromolecules. To the best of our knowledge, the Similarity Ensemble Approach (SEA) method remains the only target prediction model that has undergone systematic experimental validation [40,41,42]

Objectives

Methods

Results

Conclusion