Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study.

Andrea Marshall,Douglas G Altman,Roger L Holder

doi:10.1186/1471-2288-10-112

Andrea Marshall, Douglas G Altman + Show 1 more

Open Access

https://doi.org/10.1186/1471-2288-10-112

Copy DOI

Abstract

BackgroundThe appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. A resampling study was performed to investigate the effects of different missing data methods on the performance of a prognostic model.MethodsObserved data for 1000 cases were sampled with replacement from a large complete dataset of 7507 patients to obtain 500 replications. Five levels of missingness (ranging from 5% to 75%) were imposed on three covariates using a missing at random (MAR) mechanism. Five missing data methods were applied; a) complete case analysis (CC) b) single imputation using regression switching with predictive mean matching (SI), c) multiple imputation using regression switching imputation, d) multiple imputation using regression switching with predictive mean matching (MICE-PMM) and e) multiple imputation using flexible additive imputation models. A Cox proportional hazards model was fitted to each dataset and estimates for the regression coefficients and model performance measures obtained.ResultsCC produced biased regression coefficient estimates and inflated standard errors (SEs) with 25% or more missingness. The underestimated SE after SI resulted in poor coverage with 25% or more missingness. Of the MI approaches investigated, MI using MICE-PMM produced the least biased estimates and better model performance measures. However, this MI approach still produced biased regression coefficient estimates with 75% missingness.ConclusionsVery few differences were seen between the results from all missing data approaches with 5% missingness. However, performing MI using MICE-PMM may be the preferred missing data approach for handling between 10% and 50% MAR missingness.

Highlights

The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined
The average percentage of available covariate data items for the 1000 cases in each dataset remained relatively high for all amounts of missingness imposed; ranging from 99% with 5% missingness to 86% when 75% of cases had one or more missing data items
Regression coefficient estimates from a Cox proportional hazards model Using a complete case (CC) analysis produced very unstable regression coefficient estimates when there were large amounts of missingness, especially for the binary pre-operative RT covariate, which had a 95:5 split in the data

Summary

Introduction

The appropriate handling of missing covariate data in prognostic modelling studies is yet to be conclusively determined. Many approaches for handling missing covariates when fitting a Cox proportional hazards model have been proposed such as likelihood based techniques MICE was found to produce similar results to MI using data augmentation and assuming a joint multivariate normal model or a general location model [5]. It is not clear whether MICE with PMM would remain beneficial in other populations, where the data may be closer to the underlying assumptions of the imputation methods

Methods

Results

Conclusion