Sturmer et al. Respond to "Propensity Score Methods in Epidemiology"

T Sturmer,J Avorn,K J Rothman,S Schneeweiss,R J Glynn

doi:10.1093/aje/kwm068

Abstract

We appreciate the thoughtful commentary of Oakes and Church (1) on our paper (2) and their conclusion that propensity score calibration may be helpful when some confounders are unmeasured. We agree that usual applications of propensity scores only control for confounding by “observable selection” but we see much closer links between instrumental variables (3–5) and propensity score calibration than those described by Oakes and Church. Indeed, the gold-standard propensity score estimated in the validation study hopefully better approaches the true, but unknown, propensity of treatment than the error-prone one and thus performs as an approximate instrument under assumptions similar to surrogacy (6,7). Propensity score calibration is no panacea for missing data on confounders – there is no substitute for having good data on important confounders for every subject. Propensity score calibration was developed in a pharmacoepidemiologic analysis of claims data that lack information on a variety of confounders (8). Using data from a validation study, we obtained an estimate of the association between nonsteroidal anti-inflammatory drugs and short-term all-cause mortality in older adults that was more plausible than the naive estimate.(9) We now briefly respond to the 6 issues raised by Oakes & Church (1): The low precision of the estimation with a cohort of N=1,000 is due to the very low expected number of outcomes (N=10). We would not call this low precision an anomaly because the median OR is still unbiased. The scope of our simulations does not yet allow us to propose a sharp criterion to decide whether the surrogacy assumption is valid. The assessment of surrogacy is dependent on having outcome data in the validation study. With such data available, other methods, including imputation, are promising alternatives to propensity score calibration (10). Unfortunately, validation studies do not always contain outcome information. In such settings, propensity score calibration might be the best possibility for bias reduction. Important violations of surrogacy could be explored by considering factors measured in the validation study individually in combination with literature estimates of their independent effect on the outcome.(11) We did not address how closely the validation sample needs to be representative of the main study and there clearly are dangers in estimating the measurement error model in an external validation study (6,9). This will be an important judgment that investigators will have to make when applying propensity score calibration. Should the estimation of the measurement error model be included in the bootstrap method? The usual implementation of regression calibration takes the estimation of the measurement error model into account (12) but provided variance estimates that were too small compared with the empirical variance over simulations. We therefore used conditional mean imputation, matching, and the bootstrap for matched pairs to implement propensity score calibration, resulting in variance estimates that were close to the empirical ones (2). Because we match subjects, exposed subjects for whom no unexposed match can be found, owing to non-overlap, are automatically excluded from the analysis. Non-overlap will tend to increase with propensity score calibration, because the gold-standard propensity score is at least as strongly associated with the exposure as the error-prone one. Investigators should carefully assess exposed subjects excluded from estimation, because the estimate might not be generalizable to them.(13) Design aspects of validation studies need more attention. In pharmacoepidemiologic research based on routinely collected data, the scope of covariates that one would like to control, beyond those already contained in the administrative data, might include e.g., smoking, body mass index, physical activity, activities of daily living, and cognitive function.(9) But certainly some potential confounders and their measurements will always be elusive.

Full Text