Abstract
Abstract The interest in data anonymization is exponentially growing, motivated by the will of the governments to open their data. The main challenge of data anonymization is to find a balance between data utility and the amount of disclosure risk. One of the most known frameworks of data anonymization is k-anonymity, this method assumes that a dataset is anonymous if and only if for each element of the dataset, there exist at least k − 1 elements identical to it. In this paper, we propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM. Both, use topological collaborative clustering to obtain k-anonymous data. The first one determines the k levels automatically and the second defines it by exploration. We also improved the results of these two approaches by using pLVQ2 as a weighted vector quantization method. The four methods proposed were proven to be efficient using two data utility measures, the separability utility and the structural utility. The experimental results have shown a very promising performance.
Highlights
Nowadays, data is used in every aspect of the human life
We propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM
The first using the prototypes of the Best Matching Unit (BMU)(k-CMVM) and the second uses the linear mixture of models(Constrained CMVM)
Summary
Data is used in every aspect of the human life. Data is collected by sensors, social networks, mobile applications and connected objects to treat it, explore it, transform it and learn from it. Approaches were mainly based on the randomization method which consists of adding noise to data [1]. The risk of data privacy breach using randomization was overtaken by the emergence of the k-anonymization method [38]. This group based anonymization method outputs a dataset containing at least k identical records and the anonymization is achieved by firstly removing the key-identifiers like the name and the address and secondly by generalizing and/or suppressing the pseudo-identifiers which are for example: the date of birth, the ZIP code, the gender and the age. At the end of the topological learning, the "similar" data will be collected in clusters, corresponding to the sets of similar patterns.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have