Abstract

Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥50%) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1‐se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1‐se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets

Highlights

  • When creating a prediction model, the aim is to build a clinically useful model with satisfying predictive performance

  • In the model selection on original data approach, models obtained by the optimal penalty outperform those obtained by the 1-se penalty without recalibration, except for the AUC value with five events per variable (EPV) and 20% missing

  • In the scenario with the largest amount of information (4% of missing values and EPV 15), it even yields models with comparable AUC and Brier score to those obtained with the optimal penalty

Read more

Summary

Introduction

When creating a prediction model, the aim is to build a clinically useful model with satisfying predictive performance. Variables that are difficult or costly to measure, unreliable, or unavailable at the prediction time are less likely to increase the usability of a prediction model, their causal relation to the outcome may be strong. Parsimony is a desirable property in predictive modeling. A complex model is often more difficult to understand and communicate. Subject matter knowledge and expert opinion should be the most important rationale for selecting a variable in a prediction model (Harrell, 2015). This information is not always available, and expert opinion might introduce bias

Objectives
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call