Distributed Privacy-Aware Fast Selection Algorithm for Large-Scale Data

Hao Liu,Jiming Chen

doi:10.1109/tpds.2017.2761344

Abstract

Finding the $k$ smallest/largest element of a large array, i.e., $k$ -selection is a fundamental supporting algorithm in data analysis. Due to the fact that big data born in geo-distributed environments, it especially requires communication-efficient distributed $k$ -selection, besides typical computation and memory efficiency. Moreover, sensitive organizations make data privacy a rigorous precondition for their participation in such distributed statistical analysis for common profit. To this end, we propose a Distributed Privacy-Aware Median (DPAM) selection algorithm for median selection in distributed large-scale data while preserving local statistics privacy, and extend it to arbitrary $k$ -selection. DPAM utilizes mean to approximate median, via contraction of the standard deviation. It is the theoretical fastest with a worst computation complexity of $\mathcal{O}(N)$ , and also highly efficient in communication overhead (in logarithm of data range). To preserve $\epsilon$ -differential privacy of local statistics, DPAM randomly adds dummy elements (the number follows a rounded Laplacian distribution) to local data. The noise does not degrade the estimation precision or convergence rate. Performance of DPAM is compared with centralized/distributed quick select and optimization, in terms of complexity and privacy preserving ability. Extensive simulation and experiment results show the higher efficiency of DPAM.

Full Text