Differential privacy (DP) is a promising tool for preserving privacy during data publication, as it provides strong theoretical privacy guarantees in face of adversaries with arbitrary background knowledge. Histogram, as the result of a set of count queries, serves as a core statistical tool to report data distributions and is in fact viewed as the fundamental method for many other statistical analysis such as range queries. It is an important form for data publishing. In this paper, we consider the scenario of publishing sensitive histogram data with differential privacy scheme. Existing work in this field has justified that, comparing to directly applying DP techniques (i.e., injecting noise) over the counts in histogram bins, grouping bins before noise injection is more effective (i.e., with higher utility) as it introduces much less error over the sanitized histogram given the same privacy budget. However, state-of-the-art works have not unveiled how the overall utility of a sanitized histogram can be affected by the balance between the privacy budget distributed between grouping and noise injection phases. In this work, we conduct a theoretical study towards how the probability of getting better groups can be improved such that the overall error introduced in sanitized histogram can be further reduced, which directly leads to a higher utility for the sanitized histograms. In particular, we show that the probability of achieving better grouping can be affected by two factors, namely privacy budget assigned in grouping and the normalized utility function used for selecting groups. Motivated by that, we propose a new DP histogram publishing scheme, namely Iterative Histogram Partition, in which we carefully assign privacy budget between grouping and injection phases based on our theoretical study. We also theoretically prove that \(\epsilon \)-differential privacy can be achieved according to our new scheme. Moreover, we also show that, under the same privacy budget, our scheme exhibits less errors in the sanitized histograms comparing with state-of-the-art methods. We also extends the model to multi-dimensional histogram publication cases. Finally, empirical study over four real-world datasets also justifies that our scheme achieves the least error among series of state-of-the-art baseline methods.
Read full abstract