On Estimating the Swapping Rate for Categorical Data

Daniel Kifer

doi:10.1145/2783258.2783369

Abstract

When analyzing data, it is important to account for all sources of noise. Public use datasets, such as those provided by the Census Bureau, often undergo additional perturbations designed to protect confidentiality. This source of noise is generally ignored in data analysis because crucial parameters and details about its implementation are withheld. In this paper, we consider the problem of inferring such parameters from the data. Specifically, we target data swapping, a perturbation technique commonly used by the U.S. Census Bureau and which, barring practical breakthroughs in disclosure control, will be used in the foreseeable future. The vanilla version of data swapping selects pairs of records and exchanges some of their attribute values. The number of swapped records is kept secret even though it is needed for data analysis and investigations into the confidentiality protection of individual records. We propose algorithms for estimating the number of swapped records in categorical data, even when the true data distribution is unknown.

Full Text