When stakes are high: Balancing accuracy and transparency with Model-Agnostic Interpretable Data-driven suRRogates

Roel Henckaerts,Katrien Antonio,Marie-Pier Côté

doi:10.1016/j.eswa.2022.117230

Roel Henckaerts, Katrien Antonio + Show 1 more

Open Access

https://doi.org/10.1016/j.eswa.2022.117230

Copy DOI

Abstract

Technological advancements allow to develop high-performance black box predictive models. However, strictly regulated industries (like banking and insurance) ask for transparent decision-making algorithms. We therefore present a procedure to develop a Model-Agnostic Interpretable Data-driven suRRogate (maidrr) suited for structured tabular data. Knowledge is extracted from a black box via partial dependence effects. These are used to perform smart feature engineering by grouping variable values. This results in a segmentation of the feature space with automatic variable selection. A transparent generalized linear model (GLM) is fit to the features in categorical format and their relevant interactions. This GLM serves as a global surrogate to the original black box and replaces it in production. We demonstrate our R package maidrr with a case study on general insurance claim frequency modeling for six publicly available datasets. Our maidrr GLM closely approximates a gradient boosting machine (GBM) black box and outperforms both a linear and tree surrogate as benchmarks.

Full Text