Adding Stochastic Negative Examples into Machine Learning Improves Molecular Bioactivity Prediction.

Elena L Cáceres,Michael J Keiser,Nicholas C Mew

doi:10.1021/acs.jcim.0c00565

Elena L Cáceres, Michael J Keiser + Show 1 more

Open Access

PDF Available

https://doi.org/10.1021/acs.jcim.0c00565

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Multitask deep neural networks learn to predict ligand-target binding by example, yet public pharmacological data sets are sparse, imbalanced, and approximate. We constructed two hold-out benchmarks to approximate temporal and drug-screening test scenarios, whose characteristics differ from a random split of conventional training data sets. We developed a pharmacological data set augmentation procedure, Stochastic Negative Addition (SNA), which randomly assigns untested molecule-target pairs as transient negative examples during training. Under the SNA procedure, drug-screening benchmark performance increases from R2 = 0.1926 ± 0.0186 to 0.4269 ± 0.0272 (122%). This gain was accompanied by a modest decrease in the temporal benchmark (13%). SNA increases in drug-screening performance were consistent for classification and regression tasks and outperformed y-randomized controls. Our results highlight where data and feature uncertainty may be problematic and how leveraging uncertainty into training improves predictions of drug-target relationships.

Highlights

Machine learning and deep neural network (DNN) methods have made great strides in scientific pattern recognition, for cheminformatics[1,2,3,4,5,6,7]
We developed a machine learning training procedure to transiently add likely negative examples: unstudied pairs of small molecules and protein targets that we assert to not bind
DNN models trained with five-fold cross validation using Stochastic Negative Addition (SNA) outperformed conventionally-trained models on the screening (Drug Matrix) benchmark (Figure 2(g,h); Table 1; Supplementary Table 1; Supplementary Figures 9-10, 17-18) with little effect on training or random test performance (Figure 2(a-d); Table 1; Supplementary

Summary

Introduction

Machine learning and deep neural network (DNN) methods have made great strides in scientific pattern recognition, for cheminformatics[1,2,3,4,5,6,7]. As larger amounts of training data (molecules and their protein binding partners) have become publicly available, ligand-based predictions of polypharmacology have expanded from classification of binding (e.g. active/inactive) to regression of drug-target affinity scores (e.g., Ki, IC50)[3,4,8,9,10,11,12] These models exploit the similar property principle of chemical informatics, which states that small molecules with similar structures are likely to exhibit similar biological properties, such as their binding to protein targets[13]. Such approaches assume that the principle holds true for large datasets and hinge on the expectation that a greater diversity of training examples will increase the likelihood of a model finding generalizable patterns relating chemical structure to bioactivity. We explore the feasibility of a method that leverages uncertainty in unexplored chemical space to augment incomplete public data for small molecule activity prediction using deep learning for both classification and regression

Objectives

Methods

Results

Discussion

Conclusion