Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Chowdam Sreedhar,Pakanti Chenna Reddy,Nagulapally Kasiviswanath

doi:10.1186/s40537-017-0087-2

Chowdam Sreedhar, Pakanti Chenna Reddy + Show 1 more

Open Access

https://doi.org/10.1186/s40537-017-0087-2

Copy DOI

Journal: Journal of Big Data	Publication Date: Sep 5, 2017
Citations: 41	License type: open-access

Affiliation: G Pulla Reddy Dental College & Hospital

Abstract

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

Highlights

In the recent years, datasets generated by machines have been large in terms of volume and have been globally distributed [1]
Clustering is a challenging issue that is heavily shaped by data used and problems considered
The standard K-means method is the most popular clustering method due to its simplicity and reasonable execution efficiency when applied to small datasets

Summary

Introduction

Datasets generated by machines have been large in terms of volume and have been globally distributed [1]. There is a need to manage such large volumes of data and to cluster them for data analytics while minimizing maximum inter-cluster distances and managing large datasets Such algorithms should be efficient, scalable and highly accurate. The K-means clustering algorithm is a popular unsupervised clustering technique used to identify similarities between objects based on distance vectors suited to small datasets. Datasets generated by sources such as Wikipedia, meteorological departments, telecommunications systems, and sensors are so large that traditional K-means clustering algorithms are no longer able to group related objects to develop meaningful insights. We apply the Hadoop MapReduce standard K-means clustering algorithm to manage large datasets and introduce a new metric for similarity measurements such that the distances between objects exhibit high levels of intra-cluster similarity and low levels of inter-cluster similarity.

Background

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Optimized K-Means Clustering Model based on Gap Statistic
Amira M El-Mandouh ... Laila A
International Journal of Advanced Computer Science and Applications | VOL. 10
Amira M El-Mandouh, et. al.Amira M El-Mandouh ... Laila A
01 Jan 2019
International Journal of Advanced Computer Science and Applications | VOL. 10

A Review of Different Data Mining Techniques Used in Big Data Applications
Chandrakanta Mahanty ... Brojo Kishore Mishra
-
Chandrakanta Mahanty, et. al.Chandrakanta Mahanty ... Brojo Kishore Mishra
20 Dec 2021
20 Dec 2021

Factors Affecting Employability of Big Data Professionals: An Analysis with Special Reference to Logistics Companies in Sri Lanka
Lahiru Gunathilake ... Malsha Gishanthi
Journal of Management, Social Sciences and Humanities | VOL. 4
Lahiru Gunathilake, et. al.Lahiru Gunathilake ... Malsha Gishanthi
31 Dec 2024
Journal of Management, Social Sciences and Humanities | VOL. 4

An Investigation of Factors Affecting Employability of Big Data Professionals in Sri Lanka; With Special Reference to Logistic Companies
Lahiru Gunathilake ... Ovindi Kumarasinghe
KDU Journal of Multidisciplinary Studies | VOL. 5
Lahiru Gunathilake, et. al.Lahiru Gunathilake ... Ovindi Kumarasinghe
28 Nov 2023
KDU Journal of Multidisciplinary Studies | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering large datasets using K-means modified inter and intra clustering (KM-I2C) in Hadoop

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data