Mapreduce-Based Distributed Clustering Method Using CF+ Tree

Hyeong-Cheol Ryu,Sungwon Jung

doi:10.1109/access.2020.2999085

Hyeong-Cheol Ryu, Sungwon Jung

Open Access

https://doi.org/10.1109/access.2020.2999085

Copy DOI

Abstract

Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF + -ERC. CF + -ERC can reduce the clustering time of large data sets by utilizing the structure of a CF + tree. However, CF + -ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF + -ERC on MapReduce (CF + ERC_MR). It builds a CF + tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.

Highlights

Owing to the rapid advancement of the Internet and online technologies, exceptionally large collections of data containing digital traces from users and devices can be generated
We propose a novel distributed clustering method clustering feature (CF)+ERC_MR, which is an extension of CF+-ERC based on MapReduce
We propose a refining CF+ tree consisting of the border microclusters where each border microcluster maintains both the reduce task index and the local final cluster index

Summary

INTRODUCTION

Owing to the rapid advancement of the Internet and online technologies, exceptionally large collections of data containing digital traces from users and devices can be generated. A set of microclusters that are connected via the dotted line indicates the local final clusters determined by ERC during the reduce task. The reduce task receives the i, M -pair generated during the shuffle phase, the threshold value T, and a set V of region centroids. It builds a CF+ tree, called tree, using the microclusters of M to recheck the validity of the threshold requirement. The reduce task Ri performs ERC(tree, T ) for finding the local final cluster set Li by utilizing the structure of tree in a parallel manner.

9: Store Bi and Ii to DFS 10: return Li

16: Add E to G

THEORETICAL ANALYSIS

CONCLUSIONS