Abstract

BackgroundMultiple imputation is becoming increasingly popular for handling missing data. However, it is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, particularly as the amount of missing data increases.MethodsSimulated datasets (n = 1000) drawn from a synthetic population were used to explore information recovery from multiple imputation in estimating the coefficient of a binary exposure variable when various proportions of data (10-90%) were set missing at random in a highly-skewed continuous covariate or in the binary exposure. Imputation was performed using multivariate normal imputation (MVNI), with a simple or zero-skewness log transformation to manage non-normality. Bias, precision, mean-squared error and coverage for a set of regression parameter estimates were compared between multiple imputation and complete case analyses.ResultsFor missingness in the continuous covariate, multiple imputation produced less bias and greater precision for the effect of the binary exposure variable, compared with complete case analysis, with larger gains in precision with more missing data. However, even with only moderate missingness, large bias and substantial under-coverage were apparent in estimating the continuous covariate’s effect when skewness was not adequately addressed. For missingness in the binary covariate, all estimates had negligible bias but gains in precision from multiple imputation were minimal, particularly for the coefficient of the binary exposure.ConclusionsAlthough multiple imputation can be useful if covariates required for confounding adjustment are missing, benefits are likely to be minimal when data are missing in the exposure variable of interest. Furthermore, when there are large amounts of missingness, multiple imputation can become unreliable and introduce bias not present in a complete case analysis if the imputation model is not appropriate. Epidemiologists dealing with missing data should keep in mind the potential limitations as well as the potential benefits of multiple imputation. Further work is needed to provide clearer guidelines on effective application of this method.

Highlights

  • Statistical analysis of epidemiological data is often hindered by missing data

  • We report the results of a simulation study in which we generate missingness assuming various forms of missing at random” (MAR), so that multiple imputation would be valid if performed under a correct model, and compare inferences for regression parameters under various missing data scenarios between multiple imputation and complete case analysis

  • The results from this study demonstrate that it may be important to use multiple imputation to recover information when there are missing data in covariates required for adjustment, multiple imputation has substantially less value when there are missing data in the exposure of interest

Read more

Summary

Introduction

Statistical analysis of epidemiological data is often hindered by missing data. Multiple imputation is a two-stage process whereby missing values are imputed multiple times from a statistical model based on the available data and used in analyses that combine results across the multiply imputedA common misconception with multiple imputation arises from focussing on the mechanics of filling in missing values, as if imputation recovers a fully observed sample, when the value of multiple imputation (if any) relates to whether it recovers information about (population) parameters of interest. An alternative is fully conditional specification (FCS) where separate regression models are fitted for each variable with missingness, conditional on other variables in the imputation model [5,6] Both approaches assume values are “missing at random” (MAR), i.e. the missingness is dependent on observed values only, and rely on parametric assumptions, in particular that continuous variables are normally distributed (at least conditionally under FCS). Multiple imputation is becoming increasingly popular for handling missing data It is often implemented without adequate consideration of whether it offers any advantage over complete case analysis for the research question of interest, or whether potential gains may be offset by bias from a poorly fitting imputation model, as the amount of missing data increases

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call