The impact of missing data on analyses of a time-dependent exposure in a longitudinal cohort: a simulation study

Amalia Karahalios,Katherine J Lee,Julie A Simpson,Dallas R English,John B Carlin,Laura Baglietto

doi:10.1186/1742-7622-10-6

Abstract

BackgroundMissing data often cause problems in longitudinal cohort studies with repeated follow-up waves. Research in this area has focussed on analyses with missing data in repeated measures of the outcome, from which participants with missing exposure data are typically excluded. We performed a simulation study to compare complete-case analysis with Multiple imputation (MI) for dealing with missing data in an analysis of the association of waist circumference, measured at two waves, and the risk of colorectal cancer (a completely observed outcome).MethodsWe generated 1,000 datasets of 41,476 individuals with values of waist circumference at waves 1 and 2 and times to the events of colorectal cancer and death to resemble the distributions of the data from the Melbourne Collaborative Cohort Study. Three proportions of missing data (15, 30 and 50%) were imposed on waist circumference at wave 2 using three missing data mechanisms: Missing Completely at Random (MCAR), and a realistic and a more extreme covariate-dependent Missing at Random (MAR) scenarios. We assessed the impact of missing data on two epidemiological analyses: 1) the association between change in waist circumference between waves 1 and 2 and the risk of colorectal cancer, adjusted for waist circumference at wave 1; and 2) the association between waist circumference at wave 2 and the risk of colorectal cancer, not adjusted for waist circumference at wave 1.ResultsWe observed very little bias for complete-case analysis or MI under all missing data scenarios, and the resulting coverage of interval estimates was near the nominal 95% level. MI showed gains in precision when waist circumference was included as a strong auxiliary variable in the imputation model.ConclusionsThis simulation study, based on data from a longitudinal cohort study, demonstrates that there is little gain in performing MI compared to a complete-case analysis in the presence of up to 50% missing data for the exposure of interest when the data are MCAR, or missing dependent on covariates. MI will result in some gain in precision if a strong auxiliary variable that is not in the analysis model is included in the imputation model.

Highlights

Introduction of missing dataFor each of the 1,000 simulated datasets simulated under each “true” Hazard ratio (HR) of 1.1 and 1.5 we assigned 15, 30 and 50% of the waist circumference data at wave 2 to missing
We report the results of a simulation study that compares the performance of Multiple imputation (MI) and completecase analysis for handling missing data in waist circumference when estimating two associations: change in waist circumference and the incidence of colorectal cancer, and waist circumference at wave 2 and colorectal cancer, with data simulated using two Hazard Ratios (HRs) representing a weak and strong association between change in waist circumference and colorectal cancer, where there were different amounts of missing data according to various missing data mechanisms
The waist circumference data were set to missing according to three different scenarios: Missing completely at random (MCAR) and two missing at random scenarios, in both of which missingness is dependent on the covariates, and will be referred to by Little’s [30] terminology ‘covariate-dependent Missing at random (MAR)’

Summary

Introduction

Introduction of missing dataFor each of the 1,000 simulated datasets simulated under each “true” HR of 1.1 and 1.5 (i.e. a total of 2,000 datasets) we assigned 15, 30 and 50% of the waist circumference data at wave 2 to missing. Missing data often cause problems in longitudinal cohort studies with repeated follow-up waves Research in this area has focussed on analyses with missing data in repeated measures of the outcome, from which participants with missing exposure data are typically excluded. We performed a simulation study to compare complete-case analysis with Multiple imputation (MI) for dealing with missing data in an analysis of the association of waist circumference, measured at two waves, and the risk of colorectal cancer (a completely observed outcome). An increasing number of cohort studies are conducting repeated waves of follow-up in order to update information on their participants This collection of data allows researchers to assess the association between change in an exposure variable, measured prospectively, and the risk of a given outcome variable. Multiple imputation (MI) is an alternative method for handling missing data, which has become increasingly accessible to researchers in a number of statistical software packages (e.g. SAS [7] and Stata [8])

Objectives

Methods

Results

Discussion

Conclusion