Abstract

Risk prediction models are used to predict a clinical outcome for patients using a set of predictors. We focus on predicting low‐dimensional binary outcomes typically arising in epidemiology, health services and public health research where logistic regression is commonly used. When the number of events is small compared with the number of regression coefficients, model overfitting can be a serious problem. An overfitted model tends to demonstrate poor predictive accuracy when applied to new data. We review frequentist and Bayesian shrinkage methods that may alleviate overfitting by shrinking the regression coefficients towards zero (some methods can also provide more parsimonious models by omitting some predictors). We evaluated their predictive performance in comparison with maximum likelihood estimation using real and simulated data. The simulation study showed that maximum likelihood estimation tends to produce overfitted models with poor predictive performance in scenarios with few events, and penalised methods can offer improvement. Ridge regression performed well, except in scenarios with many noise predictors. Lasso performed better than ridge in scenarios with many noise predictors and worse in the presence of correlated predictors. Elastic net, a hybrid of the two, performed well in all scenarios. Adaptive lasso and smoothly clipped absolute deviation performed best in scenarios with many noise predictors; in other scenarios, their performance was inferior to that of ridge and lasso. Bayesian approaches performed well when the hyperparameters for the priors were chosen carefully. Their use may aid variable selection, and they can be easily extended to clustered‐data settings and to incorporate external information. © 2015 The Authors. Statistics in Medicine Published by JohnWiley & Sons Ltd.

Highlights

  • The usefulness of risk prediction models for informing patients and practitioners about the future course of a disease, guiding therapeutic strategies, aiding selection of patients for inclusion in randomised trials and in surveillance has been well established [1,2,3]

  • It is important to consider penalised methods when developing prediction models for lowdimensional data with few events. They can improve calibration and predictive accuracy compared with maximum likelihood estimation (MLE), improvement in discrimination was modest in the scenarios considered

  • When variable selection is required and no high correlations are observed between predictors, we suggest using lasso, while if there are high correlations, elastic net is the preferred option

Read more

Summary

Introduction

The usefulness of risk prediction models for informing patients and practitioners about the future course of a disease, guiding therapeutic strategies, aiding selection of patients for inclusion in randomised trials and in surveillance has been well established [1,2,3]. A risk prediction model is developed using a regression model that associates the outcome to patient characteristics, the predictor variables. A logistic regression model is commonly used. The regression model is fitted to the data at hand (training or development data set) to estimate the regression coefficients. These estimated coefficients can be used to predict the outcome in new patients. A risk model that performs well on the training data set may not perform well when it is applied to new data. Risk models that are commonly used in practice such as the ‘QRISK-2’ and the ‘Framingham’ calculator for the risk of cardiovascular disease [4, 5] and the ‘HCM-SCD calculator’ for the risk of sudden cardiac death in patients with hypertrophic aDepartment of Statistical Science, University College London, London WC1E 6BT, U.K. bMRC Biostatistics Unit, Cambridge CB2 0SR, U.K

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.