Implementation of GKMC Algorithm for Data Anonymization on Big Data Platform Spark

Mariskha Tri Adithia,Stephen Jordan,Veronica Sri Moertini

doi:10.1109/icoict55009.2022.9914906

Abstract

Data mining on data set might reveal some private information. Because of this, privacy preserving data mining can be used, to protect data privacy, yet still preserve the data utility. When the data utility is high, some meaningful insights can still be obtained. The data utility can be measured by computing the information loss. The higher the information loss, the lower the data utility of the data set. One of the privacy preserving data mining techniques is k-anonymity. The k-anonymity technique is not yet implemented on big data platform. Now, the data size grows very fast, hence the used of techniques to protect big data privacy is needed. However, some modification in the implementation should be done, so that the technique can still work properly and fast enough on the platform. In this research, two k-anonymity techniques are used, namely, Greedy k-Member Clustering (GKMC) algorithm and generalization. The two techniques are implemented on big data platform Spark; parts of the steps are already made parallel, thus can work very fast. Experiments are conducted to find out the execution time of each technique and the total information loss. From the experiments, it can be concluded that, as the value of k grows, the execution time and the total information loss decrease.

Full Text