Machine learning methods are increasingly being used to predict clinical outcomes. Optimism is the difference in model performance between derivation and validation samples. The term "data hungriness" refers to the sample size needed for a modelling technique to generate a prediction model with minimal optimism. Our objective was to compare the relative data hungriness of different statistical and machine learning methods when assessed using calibration. We used Monte Carlo simulations to assess the effect of number of events per variable (EPV) on the optimism of six learning methods when assessing model calibration: unpenalized logistic regression, ridge regression, lasso regression, bagged classification trees, random forests, and stochastic gradient boosting machines using trees as the base learners. We performed simulations in two large cardiovascular datasets each of which comprised an independent derivation and validation sample: patients hospitalized with acute myocardial infarction and patients hospitalized with heart failure. We used six data-generating processes, each based on one of the six learning methods. We allowed the sample sizes to be such that the number of EPV ranged from 10 to 200 in increments of 10. We applied six prediction methods in each of the simulated derivation samples and evaluated calibration in the simulated validation samples using the integrated calibration index, the calibration intercept, and the calibration slope. We also examined Nagelkerke's R2, the scaled Brier score, and the c-statistic. Across all 12 scenarios (2 diseases × 6 data-generating processes), penalized logistic regression displayed very low optimism even when the number of EPV was very low. Random forests and bagged trees tended to be the most data hungry and displayed the greatest optimism. When assessed using calibration, penalized logistic regression was substantially less data hungry than methods from the machine learning literature.
Read full abstract