Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions

Jürgen Bajorath,Raquel Rodríguez-Pérez

doi:10.1007/s10822-020-00314-0

Jürgen Bajorath, Raquel Rodríguez-Pérez

Open Access

https://doi.org/10.1007/s10822-020-00314-0

Copy DOI

Journal: Journal of Computer-Aided Molecular Design	Publication Date: May 2, 2020
Citations: 271	License type: open-access

Affiliation: University of Bonn

Abstract

Difficulties in interpreting machine learning (ML) models and their predictions limit the practical applicability of and confidence in ML in pharmaceutical research. There is a need for agnostic approaches aiding in the interpretation of ML models regardless of their complexity that is also applicable to deep neural network (DNN) architectures and model ensembles. To these ends, the SHapley Additive exPlanations (SHAP) methodology has recently been introduced. The SHAP approach enables the identification and prioritization of features that determine compound classification and activity prediction using any ML model. Herein, we further extend the evaluation of the SHAP methodology by investigating a variant for exact calculation of Shapley values for decision tree methods and systematically compare this variant in compound activity and potency value predictions with the model-independent SHAP method. Moreover, new applications of the SHAP analysis approach are presented including interpretation of DNN models for the generation of multi-target activity profiles and ensemble regression models for potency prediction.

Highlights

Major tasks for machine learning (ML) in chemoinformatics and medicinal chemistry include predicting new bioactive small molecules or the potency of active compounds [1,2,3,4]
The SHapley Additive exPlanations (SHAP) methodology enables the interpretation of ML models and their predictions, yielding feature importance values for individual predictions from any ML model
For models based on decision trees (DTs) ensembles, the recently developed tree SHAP algorithm makes it possible to calculate exact Shapley values, which represents the most critical step for the derivation of an explanation model

Summary

Introduction

Major tasks for machine learning (ML) in chemoinformatics and medicinal chemistry include predicting new bioactive small molecules or the potency of active compounds [1,2,3,4]. Such predictions are carried out on the basis of molecular structure, using computational descriptors calculated from molecular graph representations or conformations. In structure–activity relationship (SAR) analysis, explainable model decisions help to identify chemical changes that correlate with dependent variables and result in defined activity states or potency values Having access to such modelintrinsic information enables knowledge-based validation of models and hypothesis generation [9]. In addition to model accuracy, interpretability of predictions is a major criterion for the acceptance of computational approaches in pharmaceutical research

Methods

Results

Conclusion