Adequate sample size for developing prediction models is not simply related to events per variable

Emmanuel O Ogundimu,Douglas G Altman,Gary S Collins

doi:10.1016/j.jclinepi.2016.02.031

Emmanuel O Ogundimu, Douglas G Altman + Show 1 more

Open Access

https://doi.org/10.1016/j.jclinepi.2016.02.031

Copy DOI

Abstract

ObjectivesThe choice of an adequate sample size for a Cox regression analysis is generally based on the rule of thumb derived from simulation studies of a minimum of 10 events per variable (EPV). One simulation study suggested scenarios in which the 10 EPV rule can be relaxed. The effect of a range of binary predictors with varying prevalence, reflecting clinical practice, has not yet been fully investigated. Study Design and SettingWe conducted an extended resampling study using a large general-practice data set, comprising over 2 million anonymized patient records, to examine the EPV requirements for prediction models with low-prevalence binary predictors developed using Cox regression. The performance of the models was then evaluated using an independent external validation data set. We investigated both fully specified models and models derived using variable selection. ResultsOur results indicated that an EPV rule of thumb should be data driven and that EPV ≥ 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model. ConclusionHigher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy.

Highlights

When multivariable prediction models are developed, the sample size is often based on the ratio of the number of individuals with the outcome event to the number of candidate predictors, referred to as the events per variable (EPV)
Our results indicated that an EPV rule of thumb should be data driven and that EPV ! 20 generally eliminates bias in regression coefficients when many low-prevalence predictors are included in a Cox model
Higher EPV is needed when low-prevalence predictors are present in a model to eliminate bias in regression coefficients and improve predictive accuracy

Summary

Introduction

When multivariable prediction models are developed, the sample size is often based on the ratio of the number of individuals with the outcome event to the number of candidate predictors (more precisely, the number of parameters), referred to as the events per variable (EPV). Models developed from data sets with too few outcome events relative to the number of candidate predictors are likely to yield biased estimates of regression coefficients. They lead to unstable prediction models that are overfit to the development sample and perform poorly on new data.

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of clinical epidemiology	Publication Date: Mar 8, 2016
Citations: 316	License type: cc-by

R Discovery Prime

R Discovery Prime

Adequate sample size for developing prediction models is not simply related to events per variable

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of clinical epidemiology

Lead the way for us

Similar Papers

A simulation study of sample size demonstrated the importance of the number of events per variable to develop prediction models in clustered data
L Wynants ... Y Vergouwe
Journal of Clinical Epidemiology | VOL. 68
L Wynants, et. al.L Wynants ... Y Vergouwe
13 Feb 2015
Journal of Clinical Epidemiology | VOL. 68

Individual patients are the primary source and the target of clinical research
J André Knottnerus ... Andrea C Tricco
Journal of Clinical Epidemiology | VOL. 76
J André Knottnerus, et. al.J André Knottnerus ... Andrea C Tricco
01 Aug 2016
Journal of Clinical Epidemiology | VOL. 76

Logistic regression modeling and the number of events per variable: selection bias dominates
Ewout W Steyerberg ... Frank E Harrell
Journal of Clinical Epidemiology | VOL. 64
Ewout W Steyerberg, et. al.Ewout W Steyerberg ... Frank E Harrell
25 Oct 2011
Journal of Clinical Epidemiology | VOL. 64

Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure
Delphine S Courvoisier ... Thomas V Perneger
Journal of Clinical Epidemiology | VOL. 64
Delphine S Courvoisier, et. al.Delphine S Courvoisier ... Thomas V Perneger
25 Oct 2011
Journal of Clinical Epidemiology | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Adequate sample size for developing prediction models is not simply related to events per variable

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of clinical epidemiology