Multidimensional Canopy Clustering on Iterative MapReduce Framework Using Elefig Tool

Poonam Ghuli,Archit Shukla,Raj Kiran,Sheraaz Jason,Rajashree Shettar

doi:10.1080/03772063.2014.988760

Abstract

ABSTRACTA number of applications today deal with the processing and analysis of Big Data. As the size of the data increases, it becomes important to process it to reveal many new and interesting patterns. One such task of processing huge data is to group records into logical clusters. Most of the clustering algorithms are iterative in nature. Hence, these clustering algorithms exceptionally outperform if modelled using iterative distributed framework like Twister. Here, a canopy clustering algorithm was modelled as a series of MapReduce jobs. Once the overlapping canopies are generated, k-means clustering is applied to form actual clusters. A comparative study was performed on the variants of MapReduce framework like Twister and Hadoop. Experimental results show that, even for a large number of data points, the implementation of canopy clustering on Twister was more than three times as fast as its implementation on Hadoop. The speedup of canopy clustering using Twister was considerable – more than 24 times faster as compared to the implementation of a k-means algorithm on Hadoop. In addition to this, a new tool called Elefig is designed to facilitate the master node to automatically find the location of slave nodes in a Hadoop cluster. Without Elefig tool, one has to manually fix the problem by updating the hosts file on each node of the cluster whenever cluster is booted up.

Full Text