A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets

Satyajaswanth Badri

doi:10.1080/1206212x.2019.1624314

Abstract

Large dataset clustering is a major research issue due to huge processing time. Clustering methods goal at split large dataset into number of groups and each group consists of similar data points. Numerous clustering methods such as partition-based clustering, hierarchical clustering, density-based clustering, spectral clustering, and subspace clustering are presented. These methods failed to produce true (accurate) clusters in less response time. To mitigate the issues of existing methods, in this paper, we propose a novel Map-Scan-Reduce based density peaks (DP) clustering approach to cluster the large datasets. MapReduce is a popular distributed processing framework that has several advantages: it has the ability to resolve any issues that arise with large data volume and it partitioned data in a distributed way, but native MapReduce has certain drawbacks including high communication and computation overhead. In this paper, the Map-Scan-Reduce process is divided into three steps: MAP, SCAN, and REDUCE, which solve the problems of native MapReduce. Users scheduling and data preprocessing is implemented using adaptive neuro fuzzy scheduler and improved version of M-Z-D-S (Max–Min, Z-score, and decimal scaling). Furthermore, clusters privacy is protected using differential privacy method and clusters quality is validated using two matrices such as Silhoutte and Dunn for inter-cluster and intra-cluster validations, respectively. Finally, we conduct experiments to analyze the performance of the proposed work in terms of clustering accuracy, speedup ratio (execution time), and efficiency (precision, adjusted Rand index, and normalized mutual information).

Full Text