A new cluster validity index for prototype based clustering algorithms based on inter- and intra-cluster density

Kadim Tasdemir,Erzsebet Merenyi

doi:10.1109/ijcnn.2007.4371300

Abstract

One of the fundamental challenges of clustering is how to evaluate, without auxiliary information, to what extent the obtained clusters fit the natural partitions of the data s et. A common approach for evaluation of clustering results is to use validity indices. We propose a new validity index, Conn Index, for prototype based clustering. Conn Index is applicable to data sets with a wide variety of cluster characteristics (di fferent shapes, sizes, densities, overlaps). We construct Conn Index based on inter- and intra-cluster connectivities of prototypes, which are found through a weighted Delaunay triangulation called connectivity matrix (1), where the weights indicate the data distribution. We compare the performance of Conn Index to commonly used indices on synthetic and real data sets. I. I NTRODUCTION Clustering means splitting a data set into groups such that the data samples within a group are more similar to each other than to the data samples in other groups. Clustering is done with many methods which can be categorized in several ways where the two major ones are partitioning and hierarchical clustering. For any method, clustering the da ta directly becomes computationally heavy as the size of the data set increases. In order to significantly reduce the com- putational cost, two-step algorithms have been proposed (2), (3), (4), (5). Two-step algorithms (prototype based cluste ring) first find the quantization prototypes of data, and then clust er the prototypes. Using the prototypes instead of data can also reduce noise because the prototypes are the local averages of the data. A widely and successfully used neural paradigm for find- ing prototypes is the Self-Organizing Map (SOM). The SOM is a spatially ordered quantization of a data space where the quantization prototypes are adaptively determined for optimal approximation of the (unknown) distribution of the data. The SOM also facilitates visualization of the structu re of a higher-dimensional data space in one or two dimensions, which can guide semi-manual clustering. Thus, the SOM is a powerful aid in capturing clusters in high-dimensional intricate data sets (1), (2), (3), (6). With any clustering method, whether clustering the data itself or its prototypes, the main problems are to determine the number of clusters and to evaluate the validity of the clusters. A validity measure of the clustering ideally show s

Full Text