Abstract

Preserving data confidentiality is crucial when releasing microdata for public-use. There are a variety of proposed approaches; many of them are based on traditional probability theory and statistics. These approaches mainly focus on masking the original data. In practice, these masking techniques, despite covering part of the data, risk leaving sensitive data open to release. In this paper, we approach this problem using a deep learning-based generative model which generates simulation data to mask the original data. Generating simulation data that holds the same statistical characteristics as the raw data becomes the key idea and also the main challenge in this study. In particular, we explore the statistical similarities between the raw data and the generated data, given that the generated data and raw data are not obviously distinguishable. Two statistical evaluation metrics, Absolute Relative Residual Values and Hellinger Distance, are the evaluation methods we have decided upon to evaluate our results. We also conduct extensive experiments to validate our idea with two real-world datasets: the Census Dataset and the Environmental Dataset.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call