Data Categorization Using Hadoop MapReduce-Based Parallel K-Means Clustering

Zahid Ansari,Asif Afzal,Tanvir Habib Sardar

doi:10.1007/s40031-019-00388-x

Abstract

The volume of datasets is increasing in a very fast rate due to the expansion of digitalization of each file of work. The traditional clustering algorithm becomes ineffective in analyzing such huge volume of datasets as it requires large time to cluster such huge volume of datasets. The parallel and distributed architectures are designed to process such large datasets. In order to obtain efficiency in clustering job, traditional clustering algorithms are required to be designed for such parallel and distributed architectures. Few parallel clustering algorithms are designed for gaining efficiency in clustering which works on datasets which are loaded and accessed from main memory, which in turn develops a limitation in clustering large datasets that cannot load millions of data objects in memory at once. In this work, we have proposed a parallel version of traditional K-means so as to execute it over Hadoop distributed framework. The experimental results show that our proposed K-means algorithm outperforms traditional K-means while clustering large volume of datasets.

Full Text