Abstract

Preserving data confidentiality is crucial when releasing microdata for public-use. There are a variety of proposed approaches; many of them are based on traditional probability theory and statistics. These approaches mainly focus on masking the original data. In practice, these masking techniques, despite covering part of the data, risk leaving sensitive data open to release. In this paper, we approach this problem using a deep learning-based generative model which generates simulation data to mask the original data. Generating simulation data that holds the same statistical characteristics as the raw data becomes the key idea and also the main challenge in this study. In particular, we explore the statistical similarities between the raw data and the generated data, given that the generated data and raw data are not obviously distinguishable. Two statistical evaluation metrics, Absolute Relative Residual Values and Hellinger Distance, are the evaluation methods we have decided upon to evaluate our results. We also conduct extensive experiments to validate our idea with two real-world datasets: the Census Dataset and the Environmental Dataset.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.