The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

Manja Deforth,Georg Heinze,Ulrike Held

doi:10.1016/j.jclinepi.2024.111539

Abstract

ObjectiveThe development of clinical prediction models is often impeded by the occurrence of missing values in the predictors. Various methods for imputing missing values before modelling have been proposed. Some of them are based on variants of multiple imputation by chained equations, while others are based on single imputation. These methods may include elements of flexible modelling or machine learning algorithms, and for some of them user-friendly software packages are available. The aim of this study was to investigate by simulation if some of these methods consistently outperform others in performance measures of clinical prediction models. Study Design and SettingWe simulated development and validation cohorts by mimicking observed distributions of predictors and outcome variable of a real data set. In the development cohorts, missing predictor values were created in 36 scenarios defined by the missingness mechanism and proportion of non-complete cases. We applied three imputation algorithms that were available in R software: mice, aregImpute and missForest. These algorithms differed in their use of linear or flexible models, or random forests, the way of sampling from the predictive posterior distribution, and the generation of a single or multiple imputed data sets. For multiple imputation we also investigated the impact of the number of imputations. Logistic regression models were fitted with the simulated development cohorts before (full data analysis) and after missing value generation (complete case analysis), and with the imputed data. Prognostic model performance was measured by the scaled Brier score, c-statistic, calibration intercept and slope, and by the mean absolute prediction error evaluated in validation cohorts without missing values. Performance of full data analysis was considered as ideal. ResultsNone of the imputation methods achieved the model’s predictive accuracy that would be obtained in case of no missingness. In general, complete case analysis yielded the worst performance, and deviation from ideal performance increased with increasing percentage of missingness and decreasing sample size. Across all scenarios and performance measures, aregImpute and mice, both with 100 imputations, resulted in highest predictive accuracy. Surprisingly aregImpute outperformed full data analysis in achieving calibration slopes very close to 1 across all scenarios and outcome models. The increase of mice’s performance with 100 compared to 5 imputations was only marginal. The differences between the imputation methods decreased with increasing sample sizes and decreasing proportion of non-complete cases. ConclusionIn our simulation study, model calibration was more affected by the choice of the imputation method than model discrimination. While differences in model performance after using imputation methods were generally small, multiple imputation methods as mice and aregImpute that can handle linear or nonlinear associations between predictors and outcome are an attractive and reliable choice in most situations.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

Abstract

Talk to us

Similar Papers

More From: Journal of Clinical Epidemiology

Lead the way for us

Journal: Journal of Clinical Epidemiology	Publication Date: Sep 1, 2024
License type: cc-by

Similar Papers

How to deal with missing longitudinal data in cost of illness analysis in Alzheimer's disease-suggestions from the GERAS observational study.
Mark Belger ... Giuseppe Bruno
BMC medical research methodology | VOL. 16
Mark Belger, et. al.Mark Belger ... Giuseppe Bruno
18 Jul 2016
BMC medical research methodology | VOL. 16

Multi-metric comparison of machine learning imputation methods with application to breast cancer survival
Imad El Badisy ... Roch Giorgi
BMC Medical Research Methodology | VOL. 24
Imad El Badisy, et. al.Imad El Badisy ... Roch Giorgi
30 Aug 2024
BMC Medical Research Methodology | VOL. 24

How handling missing data may impact conclusions: A comparison of six different imputation methods for categorical questionnaire data.
Marianne Riksheim Stavseth ... Thomas Clausen
SAGE Open Medicine | VOL. 7
Marianne Riksheim Stavseth, et. al.Marianne Riksheim Stavseth ... Thomas Clausen
01 Jan 2019
SAGE Open Medicine | VOL. 7

Multiple imputation: dealing with missing data
M C M De Goeij ... G Tripepi
Nephrology Dialysis Transplantation | VOL. 28
M C M De Goeij, et. al.M C M De Goeij ... G Tripepi
31 May 2013
Nephrology Dialysis Transplantation | VOL. 28

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The performance of prognostic models depended on the choice of missing value imputation algorithm: a simulation study

Abstract

Talk to us

Similar Papers

More From: Journal of Clinical Epidemiology