Abstract
We have developed a tool for model space exploration and variable selec tion in linear regression models based on a simple spike and slab model (Dey, 2012). The model chosen is the best model with minimum final prediction error (FPE) values among all other models. This is implemented via the R package modelSampler. However, model selection based on FPE criteria is dubious and question able as FPE criteria can be sensitive to perturbations in the data. This R package can be used for empirical assessment of the stability of FPE criteria. A stable model selection is accomplished by using a bootstrap wrapper that calls the primary function of the package several times on the bootstrapped data. The heart of the method is the notion of model averaging for sta ble variable selection and to study the behavior of variables over the entire model space, a concept invaluable in high dimensional situations.
Highlights
Variable selection in linear regression models is an important aspect of many scientific analyses
For comparison purposes we have considered three different methods: Random Forest, Boosting and Bayesian Model Averaging (BMA) methods; the first two methods are frequentist methods while Bayesian model averaged (BMA) is based on Bayesian methodology
Note that out of these four methods only rescaled spike and slab (RSS) and BMA does variable selection, so OOB prediction error (PE) computations are always based on a subset of variables, whereas the Random Forest (RF) and Boosting methods use all variables for PE computation
Summary
Variable selection in linear regression models is an important aspect of many scientific analyses. Note that unlike traditional BMA where the goal is prediction (Hoeting et al, 1999), our ensemble is derived solely for purposes of variable selection This type of analysis is very different from the linear regression model implementation via bicreg function of the R package BMA (Rafttery et al, 2010) for Bayesian model averaging. The unique feature of the bimodal prior in the RSS model (details discussed later) is that it creates a unique mapping between posterior sample and a visited model (for details see the Gibbs sampler in the Appendix) This helps to perform FPE based variable selection. This R package produces high dimensional graphics to visualize several salient features related to variable selection procedure, such as importance of variables with respect to total number of variables in the data set, visualizing the entire model space, the instability of FPE criteria, prediction error plot, etc
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have