Impact Parameter Analysis of Subspace Clustering

Dongjin Lee,Junho Shim

doi:10.1155/2015/398452

Abstract

Subspace clustering, which detects all clusters in affine subspaces of a given high dimensional vector space, is used in various applications, including e-business. The performance and result of a subspace clustering algorithm highly depend on the parameter values the algorithm is tuned to execute. It may not be clear if the resultant clusters are indeed meaningful ones in a given dataset or if the result is just an artifact of the given parameter values. Although choosing the proper parameter values is crucial for both clustering quality and performance of the algorithm, there has been little research or discussion on this topic. In this paper, we propose a methodology for determining proper values of parameters in subspace clustering. Along with it, we validate our approach through experimental analysis, using various real-world datasets. The study can serve as a reference model for any subspace clustering experiment in which parameter setting is required to output clusters of quality.

Highlights

A group of algorithms called “subspace clustering” [1–4] are attracting academic interest for clustering high dimensional data
Clustering is a crucial task that is used in various applications, with the aim of detecting the dense regions of a given dataset, or as a prerequisite step for further processes, such as classification
Subspace clustering can be widely used in many smart business application areas, which may include, but are not limited to the following [5, 6]

Summary

Introduction

A group of algorithms called “subspace clustering” [1–4] are attracting academic interest for clustering high dimensional data. For these reasons, making a choice of adequate parameter values of (ε, τ) pair is crucial If their values are inappropriate, applying a subspace clustering algorithm to a given input will result in poor output or excessive running time, or possibly both. One possible method may be a trial-anderror approach, which repeatedly conducts clustering tasks with different combinations of parameter values and selects the most satisfactory result. This approach has its own limit: as clustering is inherently a computation-intensive task, its running time is typically long, so trying lots of combinations of parameters may not be practical. Experimental analysis shows that our approach is reasonable in various realworld datasets

Strategy

Experimental Setup

MB 15 MB 16 GB 80 GB SSD

Results

Conclusions