A comparison of local explanation methods for high-dimensional industrial data: A simulation study

Niklas Fries,Patrik Rydén

doi:10.1016/j.eswa.2022.117918

Abstract

Prediction methods can be augmented by local explanation methods (LEMs) to perform root cause analysis for individual observations. But while most recent research on LEMs focus on low-dimensional problems, real-world datasets commonly have hundreds or thousands of variables. Here, we investigate how LEMs perform for high-dimensional industrial applications. Seven prediction methods (penalized logistic regression, LASSO, gradient boosting, random forest and support vector machines) and three LEMs (TreeExplainer, Kernel SHAP, and the conditional normal sampling importance (CNSI)) were combined into twelve explanation approaches. These approaches were used to compute explanations for simulated data, and real-world industrial data with simulated responses. The approaches were ranked by how well they predicted the contributions according to the true models. For the simulation experiment, the generalized linear methods provided best explanations, while gradient boosting with either TreeExplainer or CNSI, or random forest with CNSI were robust for all relationships. For the real-world experiment, TreeExplainer performed similarly, while the explanations from CNSI were significantly worse. The generalized linear models were fastest, followed by TreeExplainer, while CNSI and Kernel SHAP required several orders of magnitude more computation time. In conclusion, local explanations can be computed for high-dimensional data, but the choice of statistical tools is crucial.

Full Text