Abstract

We propose the conditional predictive impact (CPI), a consistent and unbiased estimator of the association between one or several features and a given outcome, conditional on a reduced feature set. Building on the knockoff framework of Candès et al. (J R Stat Soc Ser B 80:551–577, 2018), we develop a novel testing procedure that works in conjunction with any valid knockoff sampler, supervised learning algorithm, and loss function. The CPI can be efficiently computed for high-dimensional data without any sparsity constraints. We demonstrate convergence criteria for the CPI and develop statistical inference procedures for evaluating its magnitude, significance, and precision. These tests aid in feature and model selection, extending traditional frequentist and Bayesian techniques to general supervised learning tasks. The CPI may also be applied in causal discovery to identify underlying multivariate graph structures. We test our method using various algorithms, including linear regression, neural networks, random forests, and support vector machines. Empirical results show that the CPI compares favorably to alternative variable importance measures and other nonparametric tests of conditional independence on a diverse array of real and synthetic datasets. Simulations confirm that our inference procedures successfully control Type I error with competitive power in a range of settings. Our method has been implemented in an R package, cpi, which can be downloaded from https://github.com/dswatson/cpi.

Highlights

  • Variable importance (VI) is a major topic in statistics and machine learning

  • The intuition behind the method is that if Xj does not significantly outperform Xj by some relevant importance measure, the original feature may be safely removed from the final model

  • We found significant effects at = 0.05 for the average number of rooms, percentage of lower status of the population, pupil-teacher ratio, and several other variables with both LM and support vector machine (SVM), which is in line with previous analyses (Friedman & Popescu, 2008; Williamson et al, 2021)

Read more

Summary

Introduction

Variable importance (VI) is a major topic in statistics and machine learning. It is the basis of most if not all feature selection methods, which analysts use to identify key drivers of variation in an outcome of interest and/or create more parsimonious models (Guyon & Elisseeff, 2003; Kuhn & Johnson, 2019; Meinshausen & Bühlmann, 2010). One fundamental difference between various importance measures is whether they test the marginal or conditional independence of features. To evaluate response variable Y’s marginal dependence on predictor Xj , we test against the following hypothesis: H0m ∶ Xj⊥Y, X−j, where X−j represents a set of covariates.. A measure of conditional dependence, on the other hand, tests against a different null hypothesis: H0c ∶ Xj⊥Y|X−j To evaluate response variable Y’s marginal dependence on predictor Xj , we test against the following hypothesis: H0m ∶ Xj⊥Y, X−j, where X−j represents a set of covariates. A measure of conditional dependence, on the other hand, tests against a different null hypothesis: H0c ∶ Xj⊥Y|X−j

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call