Global Combination and Clustering Based Differential Privacy Mixed Data Publishing

Lanxiang Chen,Yi Mu,Lingfang Zeng,Leilei Chen

doi:10.1109/tkde.2023.3237822

Abstract

With the rapid advancement of information technology, a large amount of high-value data have been generated. To exploit the potential value of big data and at the same time to protect individuals' sensitive information, a global combination and clustering based differential privacy (DP) mixed data publishing method is proposed in this paper. The main idea of the proposed method is to improve the truthfulness of the published data as well as to enhance the utility by shifting the sensitivity of query function from a single record to a group of records using <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -median clustering algorithm. Specifically, to improve the accuracy and utility of categorical attributes, a global combination method is proposed to take the correlation among categorical attributes into account. The proposed combination method takes all categorical attributes as a unit and then applies the exponential mechanism to improve the data utility. Then we combine it with the <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -median clustering with differential privacy to publish the mixed data. Theoretical analysis shows that the proposed method satisfies <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\varepsilon$</tex-math></inline-formula> -differential privacy. Experimental results on real datasets illustrate that the proposed method has a much lower information loss and time overhead than the state-of-the-art approach for the same parameters.

Full Text