Rounding based continuous data discretization for statistical disclosure control

Navoda Senavirathne,Vicenç Torra

doi:10.1007/s12652-019-01489-7

Abstract

“Rounding” can be understood as a way to coarsen continuous data. That is, low level and infrequent values are replaced by high-level and more frequent representative values. This concept is explored as a method for data privacy with techniques like rounding, microaggregation, and generalisation. This concept is explored as a method for data privacy in statistical disclosure control literature with perturbative techniques like rounding, microaggregation and non-perturbative methods like generalisation. Even though “rounding” is well known as a numerical data protection method, it has not been studied in depth or evaluated empirically to the best of our knowledge. This work is motivated by three objectives, (1) to study the alternative methods of obtaining the rounding values to represent a given continuous variable, (2) to empirically evaluate rounding as a data protection technique based on information loss (IL) and disclosure risk (DR), and (3) to analyse the impact of data rounding on machine learning based models. Here, in order to obtain the rounding values we consider discretization methods introduced in the unsupervised machine learning literature along with microaggregation and re-sampling based approaches. The results indicate that microaggregation based techniques are preferred over unsupervised discretization methods due to their fair trade-off between IL and DR.

Highlights

Rounding is based on the operating principle of data discretization or quantization, which maps a given continuous variable into a discrete set of values
We compare the information loss (IL), and disclosure risk (DR) values obtained for different synthetic datasets
When the results are compared based on the mean squared error (MSE) values, it can be seen that Maximum Distance to Average Vector (MDAV) reports the highest number of instances where the MSE values are lower than or equal to the original model

Summary

Introduction

Rounding is based on the operating principle of data discretization or quantization, which maps a given continuous variable into a discrete set of values. This is achieved by replacing the values of a given set X = {x1, ... Obtaining the rounding points can be explained with respect to scalar quantization (SQ). A given attribute/vector is partitioned into homogeneous groups, and a representative value is chosen for each partition as a rounding point. In information theory terminology a rounding point can be identified as a code word and the rounding set as the code book. The objective is to generate a code book in a way that minimizes the distortion introduced by

Methods

Results

Conclusion