Abstract

ABSTRACT Background: Selecting which variables to include in multiple regression models is a pervasive problem in medical research. Objectives: Based on questionnaire data (n = 18538, 69.9% men) from the Norwegian Opioid Maintenance Treatment Program, this study aims to compare the performance of different variable selection methods and the potential clinical consequences of choice of method. The effect of missing data is also explored. Methods: The dependent variable was engagement in criminal behavior while in treatment. Twenty-nine potential covariates on demographics, psychosocial factors and drug use were tested for inclusion in a multiple logistic regression model. Both complete case and multiply imputed data were considered. We compared the results from variable selection methods ranging from expert-based and purposeful variable selection, through stepwise methods, to more recently developed penalized regression using the Least Absolute Shrinkage and Selection Operator (LASSO). Results: The various variable selection methods resulted in regression models including from 9 to 22 covariates. The stepwise selection procedures generated the models with the most covariates included. The choice of variable selection method directly affected the estimated regression coefficients, both in effect size and statistical significance. For several variables the expert-based approach disagreed with all data-driven methods. Conclusions: The choice of variable selection method may strongly affect the resulting regression model, along with accompanying effect sizes and confidence intervals. This may affect clinical conclusions. The process should consequently be given sufficient consideration in model building. We recommend combining expert knowledge with a data-driven variable selection method to explore the models’ robustness.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call