Abstract

BackgroundWhen developing risk models for binary data with small or sparse data sets, the standard maximum likelihood estimation (MLE) based logistic regression faces several problems including biased or infinite estimate of the regression coefficient and frequent convergence failure of the likelihood due to separation. The problem of separation occurs commonly even if sample size is large but there is sufficient number of strong predictors. In the presence of separation, even if one develops the model, it produces overfitted model with poor predictive performance. Firth-and logF-type penalized regression methods are popular alternative to MLE, particularly for solving separation-problem. Despite the attractive advantages, their use in risk prediction is very limited. This paper evaluated these methods in risk prediction in comparison with MLE and other commonly used penalized methods such as ridge.MethodsThe predictive performance of the methods was evaluated through assessing calibration, discrimination and overall predictive performance using an extensive simulation study. Further an illustration of the methods were provided using a real data example with low prevalence of outcome.ResultsThe MLE showed poor performance in risk prediction in small or sparse data sets. All penalized methods offered some improvements in calibration, discrimination and overall predictive performance. Although the Firth-and logF-type methods showed almost equal amount of improvement, Firth-type penalization produces some bias in the average predicted probability, and the amount of bias is even larger than that produced by MLE. Of the logF(1,1) and logF(2,2) penalization, logF(2,2) provides slight bias in the estimate of regression coefficient of binary predictor and logF(1,1) performed better in all aspects. Similarly, ridge performed well in discrimination and overall predictive performance but it often produces underfitted model and has high rate of convergence failure (even the rate is higher than that for MLE), probably due to the separation problem.ConclusionsThe logF-type penalized method, particularly logF(1,1) could be used in practice when developing risk model for small or sparse data sets.

Highlights

  • When developing risk models for binary data with small or sparse data sets, the standard maximum likelihood estimation (MLE) based logistic regression faces several problems including biased or infinite estimate of the regression coefficient and frequent convergence failure of the likelihood due to separation

  • The requirement of minimum Event Per Variable (EPV) is often difficult to achieve when the risk models develop for low-dimensional data with rare outcome or small-and moderate-size, and for high-dimensional data where the number of predictors is usually higher than the number of sample observations

  • Example data: stress echocardiography data The dataset used for simulation and illustration is in public domain and originally extracted from the study conducted by Krivokapich et al [27] where the aim was to quantify the prognostic value of dobutamine stress echocardiography (DSE) in predicting cardiac events in 558 patients with known or suspected coronary artery disease

Read more

Summary

Introduction

When developing risk models for binary data with small or sparse data sets, the standard maximum likelihood estimation (MLE) based logistic regression faces several problems including biased or infinite estimate of the regression coefficient and frequent convergence failure of the likelihood due to separation. In many areas of clinical research, risk models for binary data are usually developed in the maximum-likelihood (ML) based logistic regression framework to predict the risk of a patient’s future health status such as death or illness [1, 2]. In cardiology, models may be developed to predict the risk of having cardiovascular disease Predictions based on these models are useful. The requirement of minimum EPV is often difficult to achieve when the risk models develop for low-dimensional data with rare outcome or small-and moderate-size, and for high-dimensional data where the number of predictors is usually higher than the number of sample observations

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call