Abstract

Clustering exceptionally large data sets is becoming a major challenge in data analytics with the continuous increase in their size. Summary-based clustering methods and distributed computing frameworks such as MapReduce can efficiently handle this challenge. These methods include BIRCH and its extension CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> -ERC. CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> -ERC can reduce the clustering time of large data sets by utilizing the structure of a CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> tree. However, CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> -ERC is a sequential clustering method, so it cannot be used with multiple machines to reduce the clustering time. In this study, we propose a novel MapReduce-based distributed clustering method called CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> -ERC on MapReduce (CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> ERC_MR). It builds a CF <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">+</sup> tree for clustering an exceptionally large data set with a given threshold and finds the final clusters using MapReduce, which significantly reduces the clustering time. Further, our method is scalable with respect to the number of machines. The efficacy of this method is validated through not only its theoretical analysis but also in-depth experimental analysis of exceptionally large synthetic and real data sets. The experimental results demonstrate that the clustering speed of our approach is far superior to that of the existing clustering methods.

Highlights

  • Owing to the rapid advancement of the Internet and online technologies, exceptionally large collections of data containing digital traces from users and devices can be generated

  • We propose a novel distributed clustering method clustering feature (CF)+ERC_MR, which is an extension of CF+-ERC based on MapReduce

  • We propose a refining CF+ tree consisting of the border microclusters where each border microcluster maintains both the reduce task index and the local final cluster index

Read more

Summary

INTRODUCTION

Owing to the rapid advancement of the Internet and online technologies, exceptionally large collections of data containing digital traces from users and devices can be generated. A set of microclusters that are connected via the dotted line indicates the local final clusters determined by ERC during the reduce task. The reduce task receives the i, M -pair generated during the shuffle phase, the threshold value T, and a set V of region centroids. It builds a CF+ tree, called tree, using the microclusters of M to recheck the validity of the threshold requirement. The reduce task Ri performs ERC(tree, T ) for finding the local final cluster set Li by utilizing the structure of tree in a parallel manner.

9: Store Bi and Ii to DFS 10: return Li
16: Add E to G
THEORETICAL ANALYSIS
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call