Abstract

Clustering a large amount of high dimensional spatial data sets with noises is a difficult challenge in data mining. In this paper, we present a new subspace clustering method, called SCI (Subspace Clustering based on Information), to solve this problem. The SCI combines Shannon information with grid-based and density-based clustering techniques. The design of clustering algorithms is equivalent to construct an equivalent relationship among data points. Therefore, we propose an equivalent relationship, named density-connected, to identify the main bodies of clusters. For the purpose of noise detection and cluster boundary discovery, we also use the grid approach to devise a new cohesive mechanism to merge data points of borders into clusters and to filter out the noises. However, the curse of dimensionality is a well-known serious problem of using grid approach on high dimensional data sets because the number of the grid cells grows exponentially in dimensions. To strike a compromise between the randomness and the structure, we propose an automatic method for attribute selection based on the Shannon information. With the merit of only requiring one data scan, algorithm SCI is very efficient with its run time being linear to the size of the input data set. As shown by our experimental results, SCI is very powerful to discover arbitrary shapes of clusters.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call