Optimized K-Means Clustering Model based on Gap Statistic

Amira M El-Mandouh,Hamdi A,Laila A,Mohamed H

doi:10.14569/ijacsa.2019.0100124

Amira M El-Mandouh, Hamdi A + Show 2 more

Open Access

PDF Available

https://doi.org/10.14569/ijacsa.2019.0100124

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Big data has become famous to process, store and manage massive volumes of data. Clustering is an essential phase in big data analysis for many real-life application areas uses clustering methodology for result analysis. The data clustered sets have become a challenging issue in the field of big data analytics. Among all clustering algorithm, the K-means algorithm is the most widely used unsupervised clustering approach as seen from past. The K-means algorithm is the best adapted for deciding similarities between objects based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. However, for a particular domain-specific problem the initial selection of K is still a significant concern. In this paper, an optimized clustering approach presented which is calculated the optimal number of clusters (k) for specific domain problems. The proposed approach is an optimal solution based on the cluster performance measure analysis based on gab statistic. By observation, the experimental results prove that the proposed model can efficiently enhance the speed of the clustering process and accuracy by reducing the computational complexity of the standard k-means algorithm which achieves 76.3%.

Highlights

Cluster analysis is a vital exploratory mechanism widely applied in many fields such as biology, sociology, medicine, and business
In the K-Means clustering algorithm based on Euclidean distance which measures the similarity, the k data objects farthest from each other are more representative than the k data objects randomly selected [5][6]
MapReduce is considered as an important programming paradigm for processing and generating big datasets with a parallel, distributed algorithm [15]

Summary

INTRODUCTION

Cluster analysis is a vital exploratory mechanism widely applied in many fields such as biology, sociology, medicine, and business. K-means, proposed by MacQueen, is an unsupervised learning distance-based algorithm [3] It is the famous used algorithm for cluster analysis. In the K-Means clustering algorithm based on Euclidean distance which measures the similarity, the k data objects farthest from each other are more representative than the k data objects randomly selected [5][6]. It is a process to organize the specified objects into a group of classes called clusters It had calculated similarities among objects for specific criteria. The proposed method does not demand to calculate the distance of each data point from each cluster center in each iteration due to which running time of the algorithm is reduced.

RELATED WORK

MapReduce Model

Gap Statistics

OPTIMIZED CLUSTERING APPROACH

Optimized K-Means Clustering Approach

Merging and Optimization

Experiments Evaluation Metrics

Dataset

Results

CONCLUSIONS

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2019
Citations: 11	License type: cc-by

R Discovery Prime

Optimized K-Means Clustering Model based on Gap Statistic

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop
Chowdam Sreedhar ... Pakanti Chenna Reddy
Journal of Big Data | VOL. 4
Chowdam Sreedhar, et. al.Chowdam Sreedhar ... Pakanti Chenna Reddy
05 Sep 2017
Journal of Big Data | VOL. 4

A Review of Different Data Mining Techniques Used in Big Data Applications
Chandrakanta Mahanty ... Devpriya Panda
-
Chandrakanta Mahanty, et. al.Chandrakanta Mahanty ... Devpriya Panda
20 Dec 2021
20 Dec 2021

Factors Affecting Employability of Big Data Professionals: An Analysis with Special Reference to Logistics Companies in Sri Lanka
Lahiru Gunathilake ... Malsha Gishanthi
Journal of Management, Social Sciences and Humanities | VOL. 4
Lahiru Gunathilake, et. al.Lahiru Gunathilake ... Malsha Gishanthi
31 Dec 2024
Journal of Management, Social Sciences and Humanities | VOL. 4

An Investigation of Factors Affecting Employability of Big Data Professionals in Sri Lanka; With Special Reference to Logistic Companies
Lahiru Gunathilake ... Kelum Bandara
KDU Journal of Multidisciplinary Studies | VOL. 5
Lahiru Gunathilake, et. al.Lahiru Gunathilake ... Kelum Bandara
28 Nov 2023
KDU Journal of Multidisciplinary Studies | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Optimized K-Means Clustering Model based on Gap Statistic

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications