A Comparative Performance Study of Hybrid Firefly Algorithms for Automatic Data Clustering

Absalom El-Shamir Ezugwu,Nahla Aljojo,Moyinoluwa B Agbaje,Rosanne Els,Mohamed Abd Elaziz,Haruna Chiroma

doi:10.1109/access.2020.3006173

Abstract

In cluster analysis, the goal has always been to extemporize the best possible means of automatically determining the number of clusters. However, because of lack of prior domain knowledge and uncertainty associated with data objects characteristics, it is challenging to choose an appropriate number of clusters, especially when dealing with data objects of high dimensions, varying data sizes, and density. In the last few decades, different researchers have proposed and developed several nature-inspired metaheuristic algorithms to solve data clustering problems. Many studies have shown that the firefly algorithm is a very robust, efficient and effective nature-inspired swarm intelligence global search technique, which has been successfully applied to solve diverse NP-hard optimization problems. However, the diversification search process employed by the firefly algorithm can lead to reduced speed and convergence rate for large-scale optimization problems. Thus this study investigates the application of four hybrid firefly algorithms to the task of automatic clustering of high density and large-scaled unlabelled datasets. In contrast to most of the existing classical heuristic-based data clustering analyses techniques, the proposed hybrid algorithms do not require any prior knowledge of the data objects to be classified. Instead, the hybrid methods automatically determine the optimal number of clusters empirically and during the program execution. Two well-known clustering validity indices, namely the Compact-Separated and Davis-Bouldin indices, are employed to evaluate the superiority of the implemented firefly hybrid algorithms. Furthermore, twelve standard ground truth clustering datasets from the UCI Machine Learning Repository are used to evaluate the robustness and effectiveness of the algorithms against those of the classical swarm optimization algorithms and other related clustering results from the literature. The experimental results show that the new clustering methods depict high superiority in comparison with existing standalone and other hybrid metaheuristic techniques in terms of clustering validity measures.

Highlights

Data clustering is an important unsupervised classification technique, which involves the process of grouping data so that similar items are grouped into clusters based on some similarity metric [1]–[4]
The proposed hybridization methods described in this paper focuses on exploiting the various advantage of both the firefly algorithm (FA) and other representative algorithms, namely, particle swarm optimization (PSO), artificial bee colony optimization (ABC), invasive weed optimization (IWO), and teaching learning-based optimization (TLBO) algorithms
CLUSTERING PROBLEM DESCRIPTION In this performance study, we propose a series of hybrid firefly algorithm to solve automatic data clustering problems

Summary

Introduction

Data clustering is an important unsupervised classification technique, which involves the process of grouping data so that similar items are grouped into clusters based on some similarity metric [1]–[4]. Clustering is often used for a variety of fascinating real-world applications such as in marketing, biology, image analysis, libraries, insurance, data mining, medicine, statistical data analysis, community. Cluster analysis was first used in two social sciences domains, namely, anthropology and psychology [8], it was used for trait theory classification in personality psychology by Cattell in early 1943 [8], [9]. The method of data clustering has since spread with significant relevance in application to other new research areas such as data science and machine learning. It is noteworthy to mention here that clustering data into meaningful groups is an important task of both artificial intelligence and data mining.

Objectives

Methods

Findings

Conclusion