The Effect of Clustering on Data Privacy

Pelin Canbay,Hayri Sever

doi:10.1109/icmla.2015.198

Abstract

The data obtained by various organizations provide opportunities for generating solutions in the future. It is essential that, the accurate data must be sharable with research communities and scientists in order to improve quality of life. However, accurate records of personal data include sensitive information about individuals. Hence sharing these records without applying any anonymization criteria paves the way for disclosure of personal privacy. In an effort to protect personal privacy, Privacy-Preserving Data Mining (PPDM) and Privacy-Preserving Data Publishing (PPDP) approaches have been studied extensively. Numerous works have been dedicated to diversifying techniques for de-identification or anonymization of identifiable datasets, but there is an important trade-off between data loss and data privacy. While original data anonymized, it exposed to information loss. In order to minimize information loss, the anonymization algorithms discard keeping diversity. In this study, we proposed an approach that uses a clustering algorithm as a pre-process for privacy preserving methods to improve the diversity of anonymized data. In addition, the effect of clustering on anonymization was evaluated by using original and clustered form of a real world dataset. The results are evaluated with the aspect of usability in scientific works and it was observed that a clustering algorithm and an affective anonymization algorithm must be used in privacy preservation approaches in order to keep diversity of the original datasets.

Full Text