Unsupervised ranking of clustering algorithms by INFOMAX.

Sandipan Sikdar,Animesh Mukherjee,Matteo Marsili,Qichun Zhang

doi:10.1371/journal.pone.0239331

Abstract

Clustering and community detection provide a concise way of extracting meaningful information from large datasets. An ever growing plethora of data clustering and community detection algorithms have been proposed. In this paper, we address the question of ranking the performance of clustering algorithms for a given dataset. We show that, for hard clustering and community detection, Linsker’s Infomax principle can be used to rank clustering algorithms. In brief, the algorithm that yields the highest value of the entropy of the partition, for a given number of clusters, is the best one. We show indeed, on a wide range of datasets of various sizes and topological structures, that the ranking provided by the entropy of the partition over a variety of partitioning algorithms is strongly correlated with the overlap with a ground truth partition The codes related to the project are available in https://github.com/Sandipan99/Ranking_cluster_algorithms.

Highlights

Cluster analysis is being increasingly used across wide range of applications ranging from biology and bioinformatics [1] to social networks [2] which has led to the development of a plethora of clustering algorithms
The rest of the paper will be devoted to testing the accuracy of this prediction, by comparing it with the ranking provided by the distance to the ground truth, according to the measures discussed above
We report in detail the methodology for the stock dataset which covers the case of different granularity levels of ground truth while for other cases we mainly report the results obtained

Summary

Introduction

Cluster analysis is being increasingly used across wide range of applications ranging from biology and bioinformatics [1] to social networks [2] which has led to the development of a plethora of clustering algorithms. The merging of the clusters is obtained according to some chosen similarity measure We consider both city-block (l1) and Euclidean (l2) distance based similarity measures. We consider the following three classical ways—(1) Single linkage (SI) [13], (2) Complete linkage (CO) [14] and (3) Average linkage (AV) [15] Note that ‘l1SI’ would mean single linkage with city-block as distance metric and so on. We use this combination of acronyms for the algorithms and distance metrics in all our results presented in the subsequent sections

Methods

Results

Discussion

Conclusion