Abstract

In some scenarios, the observational data needed for causal inferences are spread over two data files. In particular, we consider scenarios where one file includes covariates and the treatment measured on a set of individuals, and a second file includes responses measured on another, partially overlapping set of individuals. In the absence of error-free direct identifiers like social security numbers, straightforward merging of separate files is not feasible, so that records must be linked using error-prone variables such as names, birth dates, and demographic characteristics. Typical practice in such situations generally follows a two-stage procedure: first link the two files using a probabilistic linkage technique, then make causal inferences with the linked dataset. This does not propagate uncertainty due to imperfect linkages to the causal inference, nor does it leverage relationships among the study variables to improve the quality of the linkages. We propose a joint model for simultaneous Bayesian inference on probabilistic linkage and causal effects that addresses these deficiencies. Using simulation studies and theoretical arguments, we show that the joint model can improve the accuracy of estimated treatment effects, as well as the record linkages, compared to the two-stage modeling option. We illustrate the joint model using a constructed causal study of the effects of debit card possession on household spending.

Highlights

  • In some scenarios, researchers seek to make causal inferences from variables spread over two datasets

  • We present the joint model for Bayesian causal inference and record linkage for the setting where the outcomes y are in File A, and the covariates x and the treatment status w are in File B

  • Results based on outcome models with propensity scores computed from all records in the 1995 data are presented in Supplement I; they are essentially identical to what we present here

Read more

Summary

Introduction

Researchers seek to make causal inferences from variables spread over two datasets. The researcher first links records using a probabilistic record linkage model based on indirect identifiers, not taking into account available information on the outcome, covariate or treatment status. We follow the Bayesian paradigm for causal inference and posit models for the missing potential outcomes, conditional on the linking status and known covariates. Wortman and Reiter (2018) introduced the concept of allowing the causal model to inform the linkage model Their (non-Bayesian) approach uses point estimates of average causal effects to determine the thresholds at which record pairs are declared links in a Fellegi and Sunter (1969) algorithm.

Background and Notation for Bayesian Causal Inference
Strong ignorability
Background and Notation for Probabilistic Record Linkage
Joint Model for Bayesian Causal Inference and Record Linkage
Posterior Computation
Simulation Studies
Simulated Data Generation
Results
Causal Study of Debit Cards
Data Description and Background
Discussion and Future

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.