Abstract

We have developed a tool for model space exploration and variable selec tion in linear regression models based on a simple spike and slab model (Dey, 2012). The model chosen is the best model with minimum final prediction error (FPE) values among all other models. This is implemented via the R package modelSampler. However, model selection based on FPE criteria is dubious and question able as FPE criteria can be sensitive to perturbations in the data. This R package can be used for empirical assessment of the stability of FPE criteria. A stable model selection is accomplished by using a bootstrap wrapper that calls the primary function of the package several times on the bootstrapped data. The heart of the method is the notion of model averaging for sta ble variable selection and to study the behavior of variables over the entire model space, a concept invaluable in high dimensional situations.

Highlights

  • Variable selection in linear regression models is an important aspect of many scientific analyses

  • For comparison purposes we have considered three different methods: Random Forest, Boosting and Bayesian Model Averaging (BMA) methods; the first two methods are frequentist methods while Bayesian model averaged (BMA) is based on Bayesian methodology

  • Note that out of these four methods only rescaled spike and slab (RSS) and BMA does variable selection, so OOB prediction error (PE) computations are always based on a subset of variables, whereas the Random Forest (RF) and Boosting methods use all variables for PE computation

Read more

Summary

Introduction

Variable selection in linear regression models is an important aspect of many scientific analyses. Note that unlike traditional BMA where the goal is prediction (Hoeting et al, 1999), our ensemble is derived solely for purposes of variable selection This type of analysis is very different from the linear regression model implementation via bicreg function of the R package BMA (Rafttery et al, 2010) for Bayesian model averaging. The unique feature of the bimodal prior in the RSS model (details discussed later) is that it creates a unique mapping between posterior sample and a visited model (for details see the Gibbs sampler in the Appendix) This helps to perform FPE based variable selection. This R package produces high dimensional graphics to visualize several salient features related to variable selection procedure, such as importance of variables with respect to total number of variables in the data set, visualizing the entire model space, the instability of FPE criteria, prediction error plot, etc

Organization of the Article
A Bimodal Spike and Slab Model
Variable Selection Based on modelSampler
Optimal Model Size Determination via Hard Shrinkage and Model Averaging
? Summary
Example
Convergence of modelSampler
Diabetes Data
Icicle Plot
Out-of-bagging and the Best Subset of Variables
Empirical Study
Simulation Study
Real Data Application
Variable Stability and Model Space Revisited
Another Example of Variable Stability Plot
Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call