Fast data-oriented microaggregation algorithm for large numerical datasets

Reza Mortazavi,Saeed Jalili

doi:10.1016/j.knosys.2014.05.011

Abstract

Microaggregation is a successful mechanism to solve the tension between respondent privacy and data quality in the context of Statistical Disclosure Control. Microaggregation, for numerical datasets, is defined as a clustering problem with the constraint of having at least k records in each group, such that the sum of the within-group squared error (SSE) is minimized. Unfortunately, the data publisher has to execute an algorithm iteratively for different values of k to investigate a good trade-off between privacy and utility. Multiple execution of an algorithm on large numerical datasets is resource wasting, since most of the computations are repetitive. In this paper, we propose a Fast Data-oriented Microaggregation algorithm (FDM) that efficiently anonymizes large multivariate numerical datasets for multiple successive values of k. Experimental results on real world datasets demonstrate the superiority of the method in terms of both the data quality and time complexity. Moreover, the method usually achieves a better trade-off between disclosure risk and information loss of the protected dataset in comparison with previous techniques.

Full Text