Nonparametric Generation of Synthetic Data for Small Geographic Areas

Joseph W Sakshaug,Trivellore E Raghunathan

doi:10.1007/978-3-319-11257-2_17

Abstract

Computing and releasing statistics for small geographic areas is a common task for many statistical agencies, but releasing public-use microdata for these areas is much less common due to data confidentiality concerns. Accessing the restricted microdata is usually only possible within a research data center (RDC). This arrangement is inconvenient for many researchers who must travel large distances and, in some cases, pay a sizeable data usage fee to access the nearest RDC. An alternative data dissemination method that has been explored is to release public-use synthetic data. In general, synthetic data consists of imputed values drawn from a predictive model based on the observed data. Data confidentiality is preserved because no actual data values are released. The imputed values are typically drawn from a standard, parametric distribution, but often key variables of interest do not follow strict parametric forms. In this paper, we apply a nonparametric method for generating synthetic data for continuous variables collected from small geographic areas. The method is evaluated using data from the 2005-2007 American Community Survey. The analytic validity of the synthetic data is assessed by comparing parametric (baseline) and nonparametric inferences obtained from the synthetic data with those obtained from the observed data.Keywordsdata confidentialityhierarchical Bayesian modelmultiple imputationsmall area inference

Full Text