Abstract
The centroid-based clustering algorithm depends on the number of clusters, initial centroid, distance measures, and statistical approach of central tendencies. The initial centroid initialization algorithm defines convergence speed, computing efficiency, execution time, scalability, memory utilization, and performance issues for big data clustering. Nowadays various researchers have proposed the cluster initialization techniques, where some initialization techniques reduce the number of iterations with the lowest cluster quality, and some initialization techniques increase the cluster quality with high iterations. For these reasons, this study proposed the initial centroid initialization based Maxmin Data Range Heuristic (MDRH) method for K-Means (KM) clustering that reduces the execution times, iterations, and improves quality for big data clustering. The proposed MDRH method has compared against the classical KM and KM++ algorithms with four real datasets. The MDRH method has achieved better effectiveness and efficiency over RS, DB, CH, SC, IS, and CT quantitative measurements.
Highlights
The rapid development of digital technologies had produced enormous amounts of data in a different format at high speed, such as social media
This paper summarizes the value, veracity, variability, and visualization characteristics of big data as “ Veracity validates the accuracy basis of variety, the value identifies predicted value based on volume and variety, variability presents specific analysis tools based on the volume and variety, and visualization visualized the results and problems based on the volume, variety, and velocity.”
Efficiency and effectiveness related results shown in table 3-4 and reported results of each evaluation measure are showing the average value of ten trials
Summary
The rapid development of digital technologies had produced enormous amounts of data in a different format at high speed, such as social media. Pros and cons examinations of the initial centroid methods are shown in table 1 for big data clustering through the discussed literature and comparative analysis (Celebi et al, 2013; Fränti & Sieranoja, 2019; He et al, 2004; Peña et al, 1999; Steinley & Brusco, 2007) using random centroid, random partition, repeated heuristics, maxmin/distance optimization, greedy heuristics, sort heuristics, projection heuristics, density heuristics, and split heuristics categories This parameter identified data processing capability in the massive datasets for achieving the initial centroid. The proposed work increased the convergence speed, speed-up, and removed the worst case of local optima without the effect of cluster quality and objective
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Information Retrieval Research
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.