Abstract

Big data has become popular for processing, storing and managing massive volumes of data. The clustering of datasets has become a challenging issue in the field of big data analytics. The K-means algorithm is best suited for finding similarities between entities based on distance measures with small datasets. Existing clustering algorithms require scalable solutions to manage large datasets. This study presents two approaches to the clustering of large datasets using MapReduce. The first approach, K-Means Hadoop MapReduce (KM-HMR), focuses on the MapReduce implementation of standard K-means. The second approach enhances the quality of clusters to produce clusters with maximum intra-cluster and minimum inter-cluster distances for large datasets. The results of the proposed approaches show significant improvements in the efficiency of clustering in terms of execution times. Experiments conducted on standard K-means and proposed solutions show that the KM-I2C approach is both effective and efficient.

Highlights

  • In the recent years, datasets generated by machines have been large in terms of volume and have been globally distributed [1]

  • Clustering is a challenging issue that is heavily shaped by data used and problems considered

  • The standard K-means method is the most popular clustering method due to its simplicity and reasonable execution efficiency when applied to small datasets

Read more

Summary

Introduction

Datasets generated by machines have been large in terms of volume and have been globally distributed [1]. There is a need to manage such large volumes of data and to cluster them for data analytics while minimizing maximum inter-cluster distances and managing large datasets Such algorithms should be efficient, scalable and highly accurate. The K-means clustering algorithm is a popular unsupervised clustering technique used to identify similarities between objects based on distance vectors suited to small datasets. Datasets generated by sources such as Wikipedia, meteorological departments, telecommunications systems, and sensors are so large that traditional K-means clustering algorithms are no longer able to group related objects to develop meaningful insights. We apply the Hadoop MapReduce standard K-means clustering algorithm to manage large datasets and introduce a new metric for similarity measurements such that the distances between objects exhibit high levels of intra-cluster similarity and low levels of inter-cluster similarity.

Background
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.