Abstract

Subspace clustering, which detects all clusters in affine subspaces of a given high dimensional vector space, is used in various applications, including e-business. The performance and result of a subspace clustering algorithm highly depend on the parameter values the algorithm is tuned to execute. It may not be clear if the resultant clusters are indeed meaningful ones in a given dataset or if the result is just an artifact of the given parameter values. Although choosing the proper parameter values is crucial for both clustering quality and performance of the algorithm, there has been little research or discussion on this topic. In this paper, we propose a methodology for determining proper values of parameters in subspace clustering. Along with it, we validate our approach through experimental analysis, using various real-world datasets. The study can serve as a reference model for any subspace clustering experiment in which parameter setting is required to output clusters of quality.

Highlights

  • A group of algorithms called “subspace clustering” [1–4] are attracting academic interest for clustering high dimensional data

  • Clustering is a crucial task that is used in various applications, with the aim of detecting the dense regions of a given dataset, or as a prerequisite step for further processes, such as classification

  • Subspace clustering can be widely used in many smart business application areas, which may include, but are not limited to the following [5, 6]

Read more

Summary

Introduction

A group of algorithms called “subspace clustering” [1–4] are attracting academic interest for clustering high dimensional data. For these reasons, making a choice of adequate parameter values of (ε, τ) pair is crucial If their values are inappropriate, applying a subspace clustering algorithm to a given input will result in poor output or excessive running time, or possibly both. One possible method may be a trial-anderror approach, which repeatedly conducts clustering tasks with different combinations of parameter values and selects the most satisfactory result. This approach has its own limit: as clustering is inherently a computation-intensive task, its running time is typically long, so trying lots of combinations of parameters may not be practical. Experimental analysis shows that our approach is reasonable in various realworld datasets

Strategy
Experimental Setup
MB 15 MB 16 GB 80 GB SSD
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call