A novel algorithm for fast and scalable subspace clustering of high-dimensional data

Amardeep Kaur,Amitava Datta

doi:10.1186/s40537-015-0027-y

Amardeep Kaur, Amitava Datta

Open Access

https://doi.org/10.1186/s40537-015-0027-y

Copy DOI

Journal: Journal of Big Data	Publication Date: Aug 12, 2015
Citations: 66	License type: CC BY 4.0

Affiliation: University of Western Australia

Abstract

Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.

Highlights

With recent advancements in information technology, voluminous data are being captured in almost every conceivable area, ranging from astronomy to biological sciences
We present a novel subspace clustering algorithm that aims to remove both of these inefficiencies and has a high degree of parallelism
DBSCAN [9] is a well known full-dimensional clustering algorithm and according to it, a point is dense if it has τ or more points in its -neighborhood and a cluster is defined as a set of such dense points

Summary

Introduction

With recent advancements in information technology, voluminous data are being captured in almost every conceivable area, ranging from astronomy to biological sciences. Traditional clustering algorithms were designed to generate clusters in the full-dimensional space by measuring the proximity between the data points using all of the dimensions of a dataset [8, 9]. The curse of dimensionality implies that the data loses its contrast in the higher-dimensional space [10, 11] These fulldimensional clustering algorithms are not able to detect any meaningful clusters with the increase in dimensionality of the data. Another technique to deal with the high dimensionality is to reduce the number of dimensions by removing the irrelevant (or less relevant) dimensions, e.g. Principal Component Analysis (PCA) transforms the original high dimensional space into a low dimensional space [38]. Dimensionality reduction is not always possible [33]

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A novel algorithm for fast and scalable subspace clustering of high-dimensional data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

A meta-heuristic density-based subspace clustering algorithm for high-dimensional data
Parul Agarwal ... Ajith Abraham
Soft Computing - A Fusion of Foundations, Methodologies and Applications | VOL. 25
Parul Agarwal, et. al.Parul Agarwal ... Ajith Abraham
21 Jun 2021
Soft Computing - A Fusion of Foundations, Methodologies and Applications | VOL. 25

Grouping points by shared subspaces for effective subspace clustering
Ye Zhu ... Mark J Carman
Pattern Recognition | VOL. 83
Ye Zhu, et. al.Ye Zhu ... Mark J Carman
31 May 2018
Pattern Recognition | VOL. 83

Dimensionality-reduced subspace clustering
Reinhard Heckel ... Michael Tschannen
Information and Inference | VOL. 6
Reinhard Heckel, et. al.Reinhard Heckel ... Michael Tschannen
14 Mar 2017
Information and Inference | VOL. 6

SUBSCALE: Fast and Scalable Subspace Clustering for High Dimensional Data
Amardeep Kaur ... Amitava Datta
-
Amardeep Kaur, et. al.Amardeep Kaur ... Amitava Datta
01 Dec 2014
01 Dec 2014

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A novel algorithm for fast and scalable subspace clustering of high-dimensional data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data