A DP Canopy K-Means Algorithm for Privacy Preservation of Hadoop Platform

Tao Shang,Jianwei Liu,Zhenyu Guan,Zheng Zhao

doi:10.1007/978-3-319-69471-9_14

Abstract

K-means algorithm for data mining is combined with differential privacy preservation. Although it improves the security of data information, the selection of clustering number and initial center point is still blind and random. In this paper, we integrate an optimized Canopy algorithm with DP K-means algorithm, and apply it to Hadoop platform. Firstly, we optimize the Canopy algorithm according to the minimum and maximum principle and use the functions of the MapReduce framework to implement it. Secondly, we utilize the number and the set of center points obtained to implement the DP K-means algorithm on MapReduce. As a result, the improved Canopy algorithm can optimize the selection of the number of centers and clusters on Hadoop platform, so the proposed K-means algorithm can improve security, usability and efficiency of calculation.

Full Text