Data ultrametricity and clusterability

Dan Simovici,Kaixun Hua

doi:10.1088/1742-6596/1334/1/012002

Abstract

The increasing needs of clustering massive datasets and the high cost of running clustering algorithms poses difficult problems for users. In this context it is important to determine if a data set is clusterable, that is, it may be partitioned efficiently into well-differentiated groups containing similar objects. We approach data clusterability from an ultrametric-based perspective. A novel approach to determine the ultrametricity of a dataset is proposed via a special type of matrix product, which allows us to evaluate the clusterability of the dataset. Furthermore, we show that by applying our technique to a dissimilarity space will generate the sub-dominant ultrametric of the dissimilarity.

Highlights

Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data
Other notions define data as clusterable when the minimum between-cluster separation is greater than the maximum intra-cluster distance [13], or when each element is closer to all elements in its cluster than to all other data [7]
We propose a novel approach that relates data clusterability to the extent to which the dissimilarity defined on the data set relate to a special ultrametric defined on the set

Summary

Introduction

Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data. For the ultrametric space mentioned in Example 2.1, the closed spheres of radius 6 produce the clustering. If a dissimilarity defined on a data set is close to an ultrametric it is natural to assume that the data set is clusterable. Let L(A) be the finite set of elements in P that occur in the matrix A ∈ Pn×n. For an ultrametric matrix we have aij min{max{aik + akj} | 1 k n}. If m is the least number such that Am = Am+1, the mapping δ : S × S −→ P defined by δ(xi, xj) = (Am)ij is the subdominant ultrametric for the dissimilarity d.

A Measure of Clusterability

Experimental Evidence on Small Artificial Data Sets

Conclusions and Future Work