A meta-heuristic density-based subspace clustering algorithm for high-dimensional data

Parul Agarwal,Shikha Mehta,Ajith Abraham

doi:10.1007/s00500-021-05973-1

Abstract

Subspace clustering is one of the efficient techniques for determining the clusters in different subsets of dimensions. Ideally, these techniques should find all possible non-redundant clusters in which the data point participates. Unfortunately, existing hard subspace clustering algorithms fail to satisfy this property. Additionally, with the increase in dimensions of data, classical subspace algorithms become inefficient. This work presents a new density-based subspace clustering algorithm (S_FAD) to overcome the drawbacks of classical algorithms. S_FAD is based on a bottom-up approach and finds subspace clusters of varied density using different parameters of the DBSCAN algorithm. The algorithm optimizes parameters of the DBCAN algorithm through a hybrid meta-heuristic algorithm and uses hashing concepts to discover all non-redundant subspace clusters. The efficacy of S_FAD is evaluated against various existing subspace clustering algorithms on artificial and real datasets in terms of F_Score and rand_index. Performance is assessed based on three parameters: average ranking, SRR ranking, and scalability on varied dimensions. Statistical analysis is performed through the Wilcoxon signed-rank test. Results reveal that S_FAD performs considerably better on the majority of the datasets and scales well up to 6400 dimensions on the actual dataset.

Highlights

High-dimensional data means data with numerous features
In order to explore the experimental outcome, this section is comprised of two subsections: section 5.1 illustrates the comparison of Significance of Results of proposed Algorithm (S_FAD) algorithm with various conventional subspace algorithms on different datasets
Since the number of actual datasets considered for evaluation is 7 and twotailed tests have been employed, critical value is 2

Summary

Introduction

High-dimensional data means data with numerous features. High-dimensional data exist in various domains like recommendation systems, microarray data, social media data, and many more. Traditional clustering algorithms like K-Means, DBSCAN, OPTICS, etc.(Fahad et al, 2014) perform clustering in full-dimensional space. These algorithms attempt to find the cluster using all attributes given for each object of data. It becomes computationally expensive to apply these algorithms in the case of a large number of attributes/dimensions This problem is called the “curse of dimensionality”(Steinbach et al, 2004). One of the reasons for this problem is that distance measure loses its importance as data points are sparse in high dimensional space. Clusters in such high dimensional space remain hidden under few relevant dimensions. One of the efficient ways of performing clustering in high-dimensional data is subspace clustering

Methods

Results

Discussion

Conclusion