Abstract

Rapid growth of high dimensional datasets in recent years has created an emergent need to extract the knowledge underlying them. Clustering is the process of automatically finding groups of similar data points in the space of the dimensions or attributes of a dataset. Finding clusters in the high dimensional datasets is an important and challenging data mining problem. Data group together differently under different subsets of dimensions, called subspaces. Quite often a dataset can be better understood by clustering it in its subspaces, a process called subspace clustering. But the exponential growth in the number of these subspaces with the dimensionality of data makes the whole process of subspace clustering computationally very expensive. There is a growing demand for efficient and scalable subspace clustering solutions in many Big data application domains like biology, computer vision, astronomy and social networking. Apriori based hierarchical clustering is a promising approach to find all possible higher dimensional subspace clusters from the lower dimensional clusters using a bottom-up process. However, the performance of the existing algorithms based on this approach deteriorates drastically with the increase in the number of dimensions. Most of these algorithms require multiple database scans and generate a large number of redundant subspace clusters, either implicitly or explicitly, during the clustering process. In this paper, we present SUBSCALE, a novel clustering algorithm to find non-trivial subspace clusters with minimal cost and it requires only k database scans for a k-dimensional data set. Our algorithm scales very well with the dimensionality of the dataset and is highly parallelizable. We present the details of the SUBSCALE algorithm and its evaluation in this paper.

Highlights

  • With recent advancements in information technology, voluminous data are being captured in almost every conceivable area, ranging from astronomy to biological sciences

  • We present a novel subspace clustering algorithm that aims to remove both of these inefficiencies and has a high degree of parallelism

  • DBSCAN [9] is a well known full-dimensional clustering algorithm and according to it, a point is dense if it has τ or more points in its -neighborhood and a cluster is defined as a set of such dense points

Read more

Summary

Introduction

With recent advancements in information technology, voluminous data are being captured in almost every conceivable area, ranging from astronomy to biological sciences. Traditional clustering algorithms were designed to generate clusters in the full-dimensional space by measuring the proximity between the data points using all of the dimensions of a dataset [8, 9]. The curse of dimensionality implies that the data loses its contrast in the higher-dimensional space [10, 11] These fulldimensional clustering algorithms are not able to detect any meaningful clusters with the increase in dimensionality of the data. Another technique to deal with the high dimensionality is to reduce the number of dimensions by removing the irrelevant (or less relevant) dimensions, e.g. Principal Component Analysis (PCA) transforms the original high dimensional space into a low dimensional space [38]. Dimensionality reduction is not always possible [33]

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call