Abstract

Subspace clustering is one of the efficient techniques for determining the clusters in different subsets of dimensions. Ideally, these techniques should find all possible non-redundant clusters in which the data point participates. Unfortunately, existing hard subspace clustering algorithms fail to satisfy this property. Additionally, with the increase in dimensions of data, classical subspace algorithms become inefficient. This work presents a new density-based subspace clustering algorithm (S_FAD) to overcome the drawbacks of classical algorithms. S_FAD is based on a bottom-up approach and finds subspace clusters of varied density using different parameters of the DBSCAN algorithm. The algorithm optimizes parameters of the DBCAN algorithm through a hybrid meta-heuristic algorithm and uses hashing concepts to discover all non-redundant subspace clusters. The efficacy of S_FAD is evaluated against various existing subspace clustering algorithms on artificial and real datasets in terms of F_Score and rand_index. Performance is assessed based on three parameters: average ranking, SRR ranking, and scalability on varied dimensions. Statistical analysis is performed through the Wilcoxon signed-rank test. Results reveal that S_FAD performs considerably better on the majority of the datasets and scales well up to 6400 dimensions on the actual dataset.

Highlights

  • High-dimensional data means data with numerous features

  • In order to explore the experimental outcome, this section is comprised of two subsections: section 5.1 illustrates the comparison of Significance of Results of proposed Algorithm (S_FAD) algorithm with various conventional subspace algorithms on different datasets

  • Since the number of actual datasets considered for evaluation is 7 and twotailed tests have been employed, critical value is 2

Read more

Summary

Introduction

High-dimensional data means data with numerous features. High-dimensional data exist in various domains like recommendation systems, microarray data, social media data, and many more. Traditional clustering algorithms like K-Means, DBSCAN, OPTICS, etc.(Fahad et al, 2014) perform clustering in full-dimensional space. These algorithms attempt to find the cluster using all attributes given for each object of data. It becomes computationally expensive to apply these algorithms in the case of a large number of attributes/dimensions This problem is called the “curse of dimensionality”(Steinbach et al, 2004). One of the reasons for this problem is that distance measure loses its importance as data points are sparse in high dimensional space. Clusters in such high dimensional space remain hidden under few relevant dimensions. One of the efficient ways of performing clustering in high-dimensional data is subspace clustering

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call