Efficiency of random swap clustering

Pasi Fränti

doi:10.1186/s40537-018-0122-y

Abstract

Random swap algorithm aims at solving clustering by a sequence of prototype swaps, and by fine-tuning their exact location by k-means. This randomized search strategy is simple to implement and efficient. It reaches good quality clustering relatively fast, and if iterated longer, it finds the correct clustering with high probability. In this paper, we analyze the expected number of iterations needed to find the correct clustering. Using this result, we derive the expected time complexity of the random swap algorithm. The main results are that the expected time complexity has (1) linear dependency on the number of data vectors, (2) quadratic dependency on the number of clusters, and (3) inverse dependency on the size of neighborhood. Experiments also show that the algorithm is clearly more efficient than k-means and almost never get stuck in inferior local minimum.

Highlights

The aim of clustering is to group a set of N data vectors {xi} in D-dimensional space into k clusters by optimize a given objective function f
The factor indicates how many iterations are spent for all swaps compared to the last swap (CI = 0)
It is well known that k-means often gets stuck into an inferior local minimum, and because of this, most practitioners repeat the algorithm

Summary

Introduction

The aim of clustering is to group a set of N data vectors {xi} in D-dimensional space into k clusters by optimize a given objective function f. K-means performs the clustering by minimizing the distances of the vectors to their cluster prototype. This objective function is called sum-of-squared errors (SSE), which corresponds to minimizing withincluster variances. K-means was originally defined for numerical data only. It has been applied to other types of data. The key is to define the distance or similarity between the data vectors, and to be able to define the prototype (center). It is not trivial how to do it, but if properly solved, k-means can be applied. In case of categorical data, several alternatives were compared including k-medoids, k-modes, and k-entropies [1]

Methods

Results

Conclusion