Abstract

Missing values in covariates of regression models are a pervasive problem in empirical research. Popular approaches for analyzing partially observed datasets include complete case analysis (CCA), multiple imputation (MI), and inverse probability weighting (IPW). In the case of missing covariate values, these methods (as typically implemented) are valid under different missingness assumptions. In particular, CCA is valid under missing not at random (MNAR) mechanisms in which missingness in a covariate depends on the value of that covariate, but is conditionally independent of outcome. In this paper, we argue that in some settings such an assumption is more plausible than the missing at random assumption underpinning most implementations of MI and IPW. When the former assumption holds, although CCA gives consistent estimates, it does not make use of all observed information. We therefore propose an augmented CCA approach which makes the same conditional independence assumption for missingness as CCA, but which improves efficiency through specification of an additional model for the probability of missingness, given the fully observed variables. The new method is evaluated using simulations and illustrated through application to data on reported alcohol consumption and blood pressure from the US National Health and Nutrition Examination Survey, in which data are likely MNAR independent of outcome.

Highlights

  • Missing data in covariates of regression models are a common problem in epidemiological and clinical studies

  • We argue that in some settings an missing not at random (MNAR) missingness mechanism under which complete case analysis (CCA) is valid is more plausible than an missing at random (MAR) mechanism which is required for validity of a conventional multiple imputation (MI) or inverse probability weighting (IPW) analysis

  • We develop an augmented CCA estimation method which can improve upon the efficiency of CCA, through specification of an additional model for the probability of missingness given the fully observed variables

Read more

Summary

INTRODUCTION

Missing data in covariates of regression models are a common problem in epidemiological and clinical studies. If data are missing at random (MAR) and the imputation model is correctly specified, MI gives valid inferences, and is generally more efficient than CCA since it uses the observed data from incomplete cases and potentially from auxiliary variables which are not involved in the model of interest. This had led to MI being widely advocated and used in applications (Sterne and others, 2009).

IMPROVING UPON THE EFFICIENCY OF CCA
Missingness assumptions
Estimation with full data
Estimation with partially observed X
Implementation
SIMULATIONS
Findings
APPLICATION TO NHANES
DISCUSSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call