Simplified, interpretable graph convolutional neural networks for small molecule activity prediction

Jeffrey K Weber,Leili Zhang,Wendy D Cornell,Jan D Estrada Pabon,Sugato Bagchi,Seung-Gu Kang,Joseph A Morrone

doi:10.1007/s10822-021-00421-6

Abstract

We here present a streamlined, explainable graph convolutional neural network (gCNN) architecture for small molecule activity prediction. We first conduct a hyperparameter optimization across nearly 800 protein targets that produces a simplified gCNN QSAR architecture, and we observe that such a model can yield performance improvements over both standard gCNN and RF methods on difficult-to-classify test sets. Additionally, we discuss how reductions in convolutional layer dimensions potentially speak to the “anatomical” needs of gCNNs with respect to radial coarse graining of molecular substructure. We augment this simplified architecture with saliency map technology that highlights molecular substructures relevant to activity, and we perform saliency analysis on nearly 100 data-rich protein targets. We show that resultant substructural clusters are useful visualization tools for understanding substructure-activity relationships. We go on to highlight connections between our models’ saliency predictions and observations made in the medicinal chemistry literature, focusing on four case studies of past lead finding and lead optimization campaigns.

Highlights

Machine-learning models of quantitative structure–activity relationships (QSAR models) are staples of drug discovery research and represent some of the longest established applications of artificial intelligence in any industrial field [1,2,3,4,5,6]
Breaking down the test set into early and late subsets (80%-90% and 90%-100% time slices) illustrates possible situations in which graph convolutional neural network (gCNN) might be advantageous over random forest (RF) and in which our optimized gCNN architecture might significantly improve on the default
While the default gCNN results are on the border of significance in terms of improvement over RF, our optimized gCNN clearly performs better than both RF and the default gCNN at 95% confidence on that late subset. These results suggest that, even for relatively small labeled QSAR datasets, gCNNs should perhaps be favored over more classical QSAR methods like RF when training and test sets are distinctly separated in time and/or molecular similarity

Summary

Introduction

Machine-learning models of quantitative structure–activity relationships (QSAR models) are staples of drug discovery research and represent some of the longest established applications of artificial intelligence in any industrial field [1,2,3,4,5,6]. QSAR models apply some parametric function to relate a representation of a small molecule’s structure to an experimental measurement of a physical property, activity against a particular biomolecular target, or other observable [7]. The form of this function, which has free parameters fit to minimize deviations from experimental activity labels/ values, can range from the simple straight lines to logistic curves to “random forests” of decision trees [8] to complex arrangements of neurons distributed across several or even dozens of hidden functional layers [9] Molecular representations can be simple predefined atom and substructure count-based fingerprints (e.g., PubChem fingerprints) [10] or more general hashed radial fingerprints (e.g., ECFP4/ ECFP6 fingerprints) [11] or even vector-based molecular representations that are fully learned through some artificial intelligence approach [12]. The “neural fingerprints” that result from encoding provide rich, multiscale vectorial representations of molecules that can, in turn, be fed into additional neural network layers that facilitate activity classification

Objectives

Methods

Results

Conclusion