Abstract

BackgroundStability of risk estimates from prediction models may be highly dependent on the sample size of the dataset available for model derivation. In this paper, we evaluate the stability of cardiovascular disease risk scores for individual patients when using different sample sizes for model derivation; such sample sizes include those similar to models recommended in the national guidelines, and those based on recently published sample size formula for prediction models.MethodsWe mimicked the process of sampling N patients from a population to develop a risk prediction model by sampling patients from the Clinical Practice Research Datalink. A cardiovascular disease risk prediction model was developed on this sample and used to generate risk scores for an independent cohort of patients. This process was repeated 1000 times, giving a distribution of risks for each patient. N = 100,000, 50,000, 10,000, Nmin (derived from sample size formula) and Nepv10 (meets 10 events per predictor rule) were considered. The 5–95th percentile range of risks across these models was used to evaluate instability. Patients were grouped by a risk derived from a model developed on the entire population (population-derived risk) to summarise results.ResultsFor a sample size of 100,000, the median 5–95th percentile range of risks for patients across the 1000 models was 0.77%, 1.60%, 2.42% and 3.22% for patients with population-derived risks of 4–5%, 9–10%, 14–15% and 19–20% respectively; for N = 10,000, it was 2.49%, 5.23%, 7.92% and 10.59%, and for N using the formula-derived sample size, it was 6.79%, 14.41%, 21.89% and 29.21%. Restricting this analysis to models with high discrimination, good calibration or small mean absolute prediction error reduced the percentile range, but high levels of instability remained.ConclusionsWidely used cardiovascular disease risk prediction models suffer from high levels of instability induced by sampling variation. Many models will also suffer from overfitting (a closely linked concept), but at acceptable levels of overfitting, there may still be high levels of instability in individual risk. Stability of risk estimates should be a criterion when determining the minimum sample size to develop models.

Highlights

  • Stability of risk estimates from prediction models may be highly dependent on the sample size of the dataset available for model derivation

  • This study found that at sample sizes typically used for developing risk models, there is substantial instability in risk estimates attributable to sampling variability

  • In conclusion, cardiovascular disease (CVD) risk prediction models developed on randomly sampled cohorts of size 10,000 or less suffer from high levels of instability in individual risk predictions

Read more

Summary

Introduction

Stability of risk estimates from prediction models may be highly dependent on the sample size of the dataset available for model derivation. If the sample size is too small, the most commonly cited issue is that of overfitting, which may result in over-optimistic model performance within the development dataset and poor model performance outside of the development dataset Another potential issue, of which the implications are less clear, is that small sample sizes could lead to imprecise risk predictions. It is well known that differently defined prediction models may produce different risks for individuals, even if the models perform on the population level (i.e. have similar performance metrics such as discrimination and calibration) [10,11,12,13,14] This concept largely falls under the reference class problem [14], where a patient could be assigned multiple risk scores depending on which variables are adjusted for in the model, or assigned to different subgroups by stratifying on different variables. The variability in an individual’s risk score induced by using a small sample size is driven purely by statistical uncertainty, distinguishing this from the reference class problem

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.