Abstract

This paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationally efficient and widely available in software. However, numerous studies have found that these tools can produce diagnostics that are highly misleading, particularly when there is strong dependence among features. The purpose of our work here is to (i) review this growing body of literature, (ii) provide further demonstrations of these drawbacks along with a detailed explanation as to why they occur, and (iii) advocate for alternative measures that involve additional modeling. In particular, we describe how breaking dependencies between features in hold-out data places undue emphasis on sparse regions of the feature space by forcing the original model to extrapolate to regions where there is little to no data. We explore these effects across various model setups and find support for previous claims in the literature that PaP metrics can vastly over-emphasize correlated features in both variable importance measures and partial dependence plots. As an alternative, we discuss and recommend more direct approaches that involve measuring the change in model performance after muting the effects of the features under investigation.

Highlights

  • Machine learning methods have proved to be enormously successful tools for making predictions from data

  • As we demonstrate permutation-based methods place significant weight on exactly these predictions

  • We find that permutation importances do, recover importance ordering reliably when linear models are used as estimates in our simulation

Read more

Summary

Introduction

Machine learning methods have proved to be enormously successful tools for making predictions from data. P f (x) = β0 + βjxjj =1 where, if each covariate x j has variance 1, its permutation importance is given by β2j , regardless of the correlation among the features While this is by no means the only way to define importance for a linear model, it does correspond to the familiar incantation of “the change in y for one unit change in x j , keeping all else fixed” and could be construed as justifying permute-and-predict measures. As such, interested in the statistical properties of a machine learning method, as opposed to conducting a “model audit” in which an estimated model is considered fixed and we merely wish to summarize its behavior This is similar to the distinction in Fisher et al (2019) between model reliance and model class reliance which provides generalized notion confidence interval for variable importance. Our goals in this paper are to (i) review the growing body of literature on this topic, (ii) provide an extended, more detailed examination of the effect of these diagnostic tools along with an explanation for this behavior when applied

82 Page 4 of 16
A simple simulated example
Simulation results
82 Page 6 of 16
Extrapolation and explanations
82 Page 8 of 16
Variable importance alternatives
82 Page 10 of 16
Ordering versus testing: a further word of caution
Real world data: bike sharing
82 Page 14 of 16
A Proofs of results
Findings
82 Page 16 of 16
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call