A Parallel Clustering Analysis Based on Hadoop Multi-Node and Apache Mahout

Noor S Sagheer,Suhad A Yousif

doi:10.24996/ijs.2021.62.7.32

Abstract

The conventional procedures of clustering algorithms are incapable of overcoming the difficulty of managing and analyzing the rapid growth of generated data from different sources. Using the concept of parallel clustering is one of the robust solutions to this problem. Apache Hadoop architecture is one of the assortment ecosystems that provide the capability to store and process the data in a distributed and parallel fashion. In this paper, a parallel model is designed to process the k-means clustering algorithm in the Apache Hadoop ecosystem by connecting three nodes, one is for server (name) nodes and the other two are for clients (data) nodes. The aim is to speed up the time of managing the massive scale of healthcare insurance dataset with the size of 11 GB and also using machine learning algorithms, which are provided by the Mahout Framework. The experimental results depict that the proposed model can efficiently process large datasets. The parallel k-means algorithm outperforms the sequential k-means algorithm based on the execution time of the algorithm, where the required time to execute a data size of 11 GB is around 1.847 hours using the parallel k-means algorithm, while it equals 68.567 hours using the sequential k-means algorithm. As a result, we deduce that when the nodes number in the parallel system increases, the computation time of the proposed algorithm decreases.

Highlights

Big data is a combination of large amount, substantial, and multiple formation data created from varied and separated data sources
The connection between the master and slave nodes is performed. 5.2 Comma Separated Values (CSV) to Vectors Stage Before processing the data in Mahout, the data must be uploaded to Hadoop file system (HDFS)
The experiments are based on the initial number of clusters and the run time for both parallel and sequential k-means clustering

Summary

Introduction

Big data is a combination of large amount, substantial, and multiple formation data created from varied and separated data sources. Researchers and scientists think that big data is one of the most important subjects in computer sciences nowadays [1]. Sites of social media, hospital annals, and several new foundations are beyond the phenomenon of big Data [2]. A data warehouse cannot deal with the whole dataset because of its vast size [3]. Conventional algorithms are incapable of dealing with such enormous amounts of data, so they are not efficient for analysing them [4]. The traditional k-means clustering algorithm [5, 6] is not sufficient to manipulate the massive amount of data. Hadoop and Map Reduce tools can be used for dealing with such data [7]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Iraqi Journal of Science	Publication Date: Jul 31, 2021
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

A Parallel Clustering Analysis Based on Hadoop Multi-Node and Apache Mahout

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Iraqi Journal of Science

Lead the way for us

Similar Papers

Research and improvement of k-means parallel multi-association clustering algorithm
Su-Yu Huang ... Bo Zhang
-
Su-Yu Huang, et. al.Su-Yu Huang ... Bo Zhang
04 Dec 2020
04 Dec 2020

Performance Evaluation of Simple K-Mean and Parallel K-Mean Clustering Algorithms: Big Data Business Process Management Concept
Islam Zada ... Myriam Hadjouni
Mobile Information Systems | VOL. 2022
Islam Zada, et. al.Islam Zada ... Myriam Hadjouni
23 Jun 2022
Mobile Information Systems | VOL. 2022

GPU Accelerated K Means Clustering Refined using ANT Colony Optimization
V Saveetha ... P D R Vijaya Kumar
Asian Journal of Research in Social Sciences and Humanities | VOL. 6
V Saveetha, et. al.V Saveetha ... P D R Vijaya Kumar
01 Jan 2015
Asian Journal of Research in Social Sciences and Humanities | VOL. 6

Dividing Traffic Sub-areas Based on a Parallel K-Means Algorithm
Binfeng Wang ... Dawen Xia
-
Binfeng Wang, et. al.Binfeng Wang ... Dawen Xia
01 Jan 2014
01 Jan 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Parallel Clustering Analysis Based on Hadoop Multi-Node and Apache Mahout

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Iraqi Journal of Science