Abstract

The increasing needs of clustering massive datasets and the high cost of running clustering algorithms poses difficult problems for users. In this context it is important to determine if a data set is clusterable, that is, it may be partitioned efficiently into well-differentiated groups containing similar objects. We approach data clusterability from an ultrametric-based perspective. A novel approach to determine the ultrametricity of a dataset is proposed via a special type of matrix product, which allows us to evaluate the clusterability of the dataset. Furthermore, we show that by applying our technique to a dissimilarity space will generate the sub-dominant ultrametric of the dissimilarity.

Highlights

  • Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data

  • Other notions define data as clusterable when the minimum between-cluster separation is greater than the maximum intra-cluster distance [13], or when each element is closer to all elements in its cluster than to all other data [7]

  • We propose a novel approach that relates data clusterability to the extent to which the dissimilarity defined on the data set relate to a special ultrametric defined on the set

Read more

Summary

Introduction

Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data. For the ultrametric space mentioned in Example 2.1, the closed spheres of radius 6 produce the clustering. If a dissimilarity defined on a data set is close to an ultrametric it is natural to assume that the data set is clusterable. Let L(A) be the finite set of elements in P that occur in the matrix A ∈ Pn×n. For an ultrametric matrix we have aij min{max{aik + akj} | 1 k n}. If m is the least number such that Am = Am+1, the mapping δ : S × S −→ P defined by δ(xi, xj) = (Am)ij is the subdominant ultrametric for the dissimilarity d.

A Measure of Clusterability
Experimental Evidence on Small Artificial Data Sets
Conclusions and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call