Abstract

The demand for data from surveys, censuses or registers containing sensible information on people or enterprises has increased significantly over the last years. However, before data can be provided to the public or to researchers, confidentiality has to be respected for any data set possibly containing sensible information about individual units. Confidentiality can be achieved by applying statistical disclosure control (SDC) methods to the data in order to decrease the disclosure risk of data. The R package sdcMicro serves as an easy-to-handle, object-oriented S4 class implementation of SDC methods to evaluate and anonymize confidential micro-data sets. It includes all popular disclosure risk and perturbation methods. The package performs automated recalculation of frequency counts, individual and global risk measures, information loss and data utility statistics after each anonymization step. All methods are highly optimized in terms of computational costs to be able to work with large data sets. Reporting facilities that summarize the anonymization process can also be easily used by practitioners. We describe the package and demonstrate its functionality with a complex household survey test data set that has been distributed by the International Household Survey Network.

Highlights

  • Statistical disclosure control (SDC) is an emerging field of research

  • Methods used in statistical disclosure control borrow techniques from other fields

  • For each method discussed we show its usage via the command line interface of sdcMicro

Read more

Summary

Introduction

Statistical disclosure control (SDC) is an emerging field of research. More and more data on persons and establishments are collected by statistical organizations and almost all of these data holds confidential information. Public-use and sdcMicro: Statistical Disclosure Control for Micro-Data in R hMhehthhodhhhhhhhhShofhtwharheh μ-Argus 4.2. We cannot compare computation speed of μ-Argus to sdcMicro, as methods cannot be applied using a command line interface, but we would like to point out that μ-Argus is not suitable for large data sets. It becomes slow and runs out-of-memory even with medium-sized data sets. It starts with disclosure risk methods, followed by anonymization methods and methods to measure data utility.

Classification of variables
Challenges
Work flow
General information about sdcMicro and performance
S4 class structure
Aim
Utility functions
Methods
Measuring the disclosure risk
Anonymization methods
Measuring data utility
Reporting facilities
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call