Abstract

In deriving a regression model analysts often have to use variable selection, despite of problems introduced by data- dependent model building. Resampling approaches are proposed to handle some of the critical issues. In order to assess and compare several strategies, we will conduct a simulation study with 15 predictors and a complex correlation structure in the linear regression model. Using sample sizes of 100 and 400 and estimates of the residual variance corresponding to R2 of 0.50 and 0.71, we consider 4 scenarios with varying amount of information. We also consider two examples with 24 and 13 predictors, respectively. We will discuss the value of cross-validation, shrinkage and backward elimination (BE) with varying significance level. We will assess whether 2-step approaches using global or parameterwise shrinkage (PWSF) can improve selected models and will compare results to models derived with the LASSO procedure. Beside of MSE we will use model sparsity and further criteria for model assessment. The amount of information in the data has an influence on the selected models and the comparison of the procedures. None of the approaches was best in all scenarios. The performance of backward elimination with a suitably chosen significance level was not worse compared to the LASSO and BE models selected were much sparser, an important advantage for interpretation and transportability. Compared to global shrinkage, PWSF had better performance. Provided that the amount of information is not too small, we conclude that BE followed by PWSF is a suitable approach when variable selection is a key part of data analysis.

Highlights

  • In deriving a suitable regression model analysts are often faced with many predictors which may have an influence on the outcome

  • Provided that the amount of information is not too small, we conclude that backward elimination (BE) followed by parameterwise shrinkage factor (PWSF) is a suitable approach when variable selection is a key part of data analysis

  • Concerning prediction error, it seems to be OK for noisy data, but it is beaten by variable selection followed by some form of shrinkage if the data are less noisy

Read more

Summary

Introduction

In deriving a suitable regression model analysts are often faced with many predictors which may have an influence on the outcome. By minimizing residuals under a constraint it combines variable selection with shrinkage It can be regarded, in a wider sense, as a generalization of an approach by [2], who propose to improve predictors with respect to the average prediction error by multiplying the estimated effect of each covariate with a constant, an estimated shrinkage factor. As the bias caused by variable selection is usually different for individual covariates, [4] extends their idea by proposing parameterwise shrinkage factors The latter approach is intended as a post-estimation shrinkage procedure after selection of variables.

Simulation Design
The Value of Cross-Validation
Global Shrinkage
Model Selection
Assessment of Prediction Error
Post-Selection Cross-Validation
Post-Selection Shrinkage
Comparison with LASSO
Ozone Data
Body Fat Data
Discussion and Conclusions
Cross-Validation and Shrinkage without Selection
Variable Selection and Post Selection Shrinkage
Comparison with LASSO and Similar Procedures
Findings
Directions for Future Research
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call