Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study

Douglas G Altman,Andrea Marshall,Patrick Royston,Roger L Holder

doi:10.1186/1471-2288-10-7

Abstract

BackgroundThere is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. Therefore a simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model.MethodsDatasets were generated to resemble the skewed distributions seen in a motivating breast cancer example. Multivariate missing data were imposed on four covariates using four different mechanisms; missing completely at random (MCAR), missing at random (MAR), missing not at random (MNAR) and a combination of all three mechanisms. Five amounts of incomplete cases from 5% to 75% were considered. Complete case analysis (CC), single imputation (SI) and five multiple imputation (MI) techniques available within the R statistical software were investigated: a) data augmentation (DA) approach assuming a multivariate normal distribution, b) DA assuming a general location model, c) regression switching imputation, d) regression switching with predictive mean matching (MICE-PMM) and e) flexible additive imputation models. A Cox proportional hazards model was fitted and appropriate estimates for the regression coefficients and model performance measures were obtained.ResultsPerforming a CC analysis produced unbiased regression estimates, but inflated standard errors, which affected the significance of the covariates in the model with 25% or more missingness. Using SI, underestimated the variability; resulting in poor coverage even with 10% missingness. Of the MI approaches, applying MICE-PMM produced, in general, the least biased estimates and better coverage for the incomplete covariates and better model performance for all mechanisms. However, this MI approach still produced biased regression coefficient estimates for the incomplete skewed continuous covariates when 50% or more cases had missing data imposed with a MCAR, MAR or combined mechanism. When the missingness depended on the incomplete covariates, i.e. MNAR, estimates were biased with more than 10% incomplete cases for all MI approaches.ConclusionThe results from this simulation study suggest that performing MICE-PMM may be the preferred MI approach provided that less than 50% of the cases have missing data and the missing data are not MNAR.

Highlights

There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies
The regression coefficient estimates after performing a complete case (CC) analysis remained within the limits for unproblematic estimates [12] of ± 0.5SE for all levels of missingness and were generally closer to the true value for all covariates than after using single imputation (SI) or Multiple imputation (MI) (Figure 2)
For SI and most MI approaches, the regression coefficient estimates were more than 0.5SE away from the true value for the two incomplete continuous covariates (X2 and X3) and X4, the covariate highly correlated with X3, when 25% or more of the cases had at least one covariate missing

Summary

Introduction

There is no consensus on the most appropriate approach to handle missing covariate data within prognostic modelling studies. A simulation study was performed to assess the effects of different missing data techniques on the performance of a prognostic model. Missing covariate data complicates the analysis, but often occurs [1]. A review of published prognostic studies [1] found that on average 13% of cases had incomplete data (range 0 - 60%) in 39 studies where this information could be obtained. Using the cases with complete covariate data, i.e. performing a complete case (CC) analysis, loses information and efficiency, and may lead to biased regression coefficients if the missingness is related to the outcome [2,3]. These generally require problem-specific programs to be written and may not be readily available

Objectives

Methods

Results

Conclusion