Abstract

Abstract The interest in data anonymization is exponentially growing, motivated by the will of the governments to open their data. The main challenge of data anonymization is to find a balance between data utility and the amount of disclosure risk. One of the most known frameworks of data anonymization is k-anonymity, this method assumes that a dataset is anonymous if and only if for each element of the dataset, there exist at least k − 1 elements identical to it. In this paper, we propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM. Both, use topological collaborative clustering to obtain k-anonymous data. The first one determines the k levels automatically and the second defines it by exploration. We also improved the results of these two approaches by using pLVQ2 as a weighted vector quantization method. The four methods proposed were proven to be efficient using two data utility measures, the separability utility and the structural utility. The experimental results have shown a very promising performance.

Highlights

  • Nowadays, data is used in every aspect of the human life

  • We propose two techniques to achieve k-anonymity through microaggregation: k-CMVM and Constrained-CMVM

  • The first using the prototypes of the Best Matching Unit (BMU)(k-CMVM) and the second uses the linear mixture of models(Constrained CMVM)

Read more

Summary

Introduction

Data is used in every aspect of the human life. Data is collected by sensors, social networks, mobile applications and connected objects to treat it, explore it, transform it and learn from it. Approaches were mainly based on the randomization method which consists of adding noise to data [1]. The risk of data privacy breach using randomization was overtaken by the emergence of the k-anonymization method [38]. This group based anonymization method outputs a dataset containing at least k identical records and the anonymization is achieved by firstly removing the key-identifiers like the name and the address and secondly by generalizing and/or suppressing the pseudo-identifiers which are for example: the date of birth, the ZIP code, the gender and the age. At the end of the topological learning, the "similar" data will be collected in clusters, corresponding to the sets of similar patterns.

Fundamental background of the proposed approaches
Multi-view Collaborative Learning
Proposed Anonymization Approaches
Pre-Anonymization Step
Constrained CMVM
Fine tuning
Incorporating Discriminative Power
Datasets
Utility Measures and Statistical Analysis
Davies Bouldin Index
Silhouette Index
Calinski Harabasz Index
Structural Utility using the Earth Mover’s Distance
Preserving combined utility
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call