Abstract

Data often exists in subspaces embedded within a high-dimensional space. Subspace clustering seeks to group data according to the dimensions relevant to each subspace. This requires the estimation of subspaces as well as the clustering of data. Subspace clustering becomes increasingly challenging in high dimensional spaces due to the curse of dimensionality which affects reliable estimations of distances and density. Recently, another aspect of high-dimensional spaces has been observed, known as the hubness phenomenon, whereby few data points appear frequently as nearest neighbors of the rest of the data. The distribution of neighbor occurrences becomes skewed with increasing intrinsic dimensionality of the data, and few points with high neighbor occurrences emerge as hubs. Hubs exhibit useful geometric properties and have been leveraged for clustering data in the full-dimensional space. In this paper, we study hubs in the context of subspace clustering. We present new characterizations of hubs in relation to subspaces, and design graph-based meta-features to identify a subset of hubs which are well fit to serve as seeds for the discovery of local latent subspaces and clusters. We propose and evaluate a hubness-driven algorithm to find subspace clusters, and show that our approach is superior to the baselines, and is competitive against state-of-the-art subspace clustering methods. We also identify the data characteristics that make hubs suitable for subspace clustering. Such characterization gives valuable guidelines to data mining practitioners.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call