Abstract

M issing covariate data is a feature common to most epidemiologic and clinical studies. It is now widely recognized that the routinely used complete-case analysis-which excludes subjects with any missing variable-can (i) yield highly inefficient estimators in regression analysis, and (ii) be severely biased when the data are not a completely random sample of the full data (thus not missing completely at random). In the simplest and most familiar setting where the observed data arise from simple random sampling (ie, independent and identically distributed data), weighted estimating equations provide a powerful framework to perform regression analysis, while appropriately accounting for covariate data missing at random but not necessarily missing completely at random. In this issue of EPIDEMIOLOGY, Moore et all use 2-stage weighted estimating equations or re-weighted estimating equations to perform logistic regression of Y, the indicator of trying to loose weight in the last 12 months on Z, the indicator of high cholesterol that is missing by happenstance, controlling for a subset of fully observed covariates X collected in the third National Health and Nutrition Examination Survey (NHANES III). NHANES III used stratified, multistage sampling and thus does not follow a simple random sampling design. The authors use all sampled participants with complete observations, but propose to jointly account for individuals' differential propensity both to be selected into the survey sample and to have an observed cholesterol indicator. They note that this can be achieved by multiplying participants' respective contribution to the complete-case logistic score equation by the product of the (known) inverse probability of being selected and the (unknown) inverse probability of having an observed cholesterol indicator (see equation (3) of their paper). The missingness mechanism, which stochastically determines which of the sampled participants have complete data, is unknown to the analyst and therefore must be estimated. Fortunately, missingness at random implies that an individual's probability of having an observed cholesterol indicator (R = 1) does not depend on whether or not she has high cholesterol, but may depend on other fully observed variables Y, X, and S. Here, S includes covariates that might not be of primary scientific interest, but are necessary with Y and X, to explain any association between R and Z. Under this assumption, one can therefore proceed by substituting the unknown weights with estimated weights,2 con structed with an estimate 7 -= it (X, Y, S; a^) of Pr(R = 1 IX, Y, S), where a^ denotes the observed data maximum likelihood estimator of the coefficients of the logistic regression of R on (X, Y, S). Moore et al essentially adapt this approach to their two-stage procedure, but require the stronger assumption that S can be dropped from a. In this commentary, special consideration is given to issues of statistical efficiency and modeling robustness. First, even in the absence of model mis-specification, inverse probability weighting can be highly inefficient. This is because, even when (as suggested by Moore et al), one fits a highly parameterized model for it (of course within limits of the data), inverse-probability weighting still fails to make optimal use of data (Y, X, 3) observed among all individuals including those missing Z. Second, an incorrect working

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.