Maxmin Data Range Heuristic-Based Initial Centroid Method of Partitional Clustering for Big Data Mining

Kamlesh Kumar Pandey,Diwakar Shukla

doi:10.4018/ijirr.289954

Kamlesh Kumar Pandey, Diwakar Shukla

Open Access

https://doi.org/10.4018/ijirr.289954

Copy DOI

Abstract

The centroid-based clustering algorithm depends on the number of clusters, initial centroid, distance measures, and statistical approach of central tendencies. The initial centroid initialization algorithm defines convergence speed, computing efficiency, execution time, scalability, memory utilization, and performance issues for big data clustering. Nowadays various researchers have proposed the cluster initialization techniques, where some initialization techniques reduce the number of iterations with the lowest cluster quality, and some initialization techniques increase the cluster quality with high iterations. For these reasons, this study proposed the initial centroid initialization based Maxmin Data Range Heuristic (MDRH) method for K-Means (KM) clustering that reduces the execution times, iterations, and improves quality for big data clustering. The proposed MDRH method has compared against the classical KM and KM++ algorithms with four real datasets. The MDRH method has achieved better effectiveness and efficiency over RS, DB, CH, SC, IS, and CT quantitative measurements.

Highlights

The rapid development of digital technologies had produced enormous amounts of data in a different format at high speed, such as social media
This paper summarizes the value, veracity, variability, and visualization characteristics of big data as “ Veracity validates the accuracy basis of variety, the value identifies predicted value based on volume and variety, variability presents specific analysis tools based on the volume and variety, and visualization visualized the results and problems based on the volume, variety, and velocity.”
Efficiency and effectiveness related results shown in table 3-4 and reported results of each evaluation measure are showing the average value of ten trials

Summary

INTRODUCTION

The rapid development of digital technologies had produced enormous amounts of data in a different format at high speed, such as social media. Pros and cons examinations of the initial centroid methods are shown in table 1 for big data clustering through the discussed literature and comparative analysis (Celebi et al, 2013; Fränti & Sieranoja, 2019; He et al, 2004; Peña et al, 1999; Steinley & Brusco, 2007) using random centroid, random partition, repeated heuristics, maxmin/distance optimization, greedy heuristics, sort heuristics, projection heuristics, density heuristics, and split heuristics categories This parameter identified data processing capability in the massive datasets for achieving the initial centroid. The proposed work increased the convergence speed, speed-up, and removed the worst case of local optima without the effect of cluster quality and objective

Objective

Evaluation Criteria

Results and Discussion

CONCLUSION