Fast and eager [formula omitted]-medoids clustering: [formula omitted] runtime improvement of the PAM, CLARA, and CLARANS algorithms

Erich Schubert,Peter J Rousseeuw

doi:10.1016/j.is.2021.101804

Erich Schubert, Peter J Rousseeuw

Open Access

https://doi.org/10.1016/j.is.2021.101804

Copy DOI

Journal: Information Systems	Publication Date: May 21, 2021
Citations: 67	License type: cc-by

Affiliation: TU Dortmund University, KU Leuven

Abstract

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean – as used in k-means – is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second (“SWAP”) phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic (“BUILD”) initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458× respectively 1191× speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

Highlights

Clustering is a common unsupervised machine learning task, in which the data set has to be automatically partitioned into “clusters”, such that objects within the same cluster are more similar, while objects in different clusters are more different
The algorithm CLARANS (Clustering Large Applications based on RANdomized Search, Ng and Han 1994, 2002) interprets the search space as a high-dimensional hypergraph, where each edge corresponds to swapping a medoid and non-medoid
We focus on improving the original Partitioning Around Medoids (PAM) algorithm here, which is a commonly used subroutine even in the faster variants such as CLARA

Summary

Introduction

Clustering is a common unsupervised machine learning task, in which the data set has to be automatically partitioned into “clusters”, such that objects within the same cluster are more similar, while objects in different clusters are more different. A classic method taught in textbooks is k-means (for an overview of the complicated history of k-means, refer to Bock, 2007), where the data is modeled using k cluster means, that are iteratively refined by assigning all objects to the nearest mean, recomputing the mean of each cluster. This converges to a local optimum because the mean is the least squares estimator of location, and both steps reduce the same quantity, a measure known as sum-of-squared errors (SSQ): SSQ :=. We include a brief recap on the history of the PAM algorithm, updated and more extensive benchmarks on additional data sets, and cover additional related work

On the History of PAM

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Fast and eager [formula omitted]-medoids clustering: [formula omitted] runtime improvement of the PAM, CLARA, and CLARANS algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Systems

Lead the way for us

Similar Papers

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Erich Schubert ... Peter J Rousseeuw
-
Erich Schubert, et. al.Erich Schubert ... Peter J Rousseeuw
01 Jan 2019
01 Jan 2019

Simulated Annealing Partitioning: An Algorithm for Optimizing Grouping in Cancer Data
Ran Qi ... Shujia Zhou
-
Ran Qi, et. al.Ran Qi ... Shujia Zhou
01 Dec 2013
01 Dec 2013

An Improvement of K-Medoids Clustering Algorithm Based on Fixed Point Iteration
Xiaodi Huang ... Minglun Ren
International Journal of Data Warehousing and Mining | VOL. 16
Xiaodi Huang, et. al.Xiaodi Huang ... Minglun Ren
01 Oct 2020
International Journal of Data Warehousing and Mining | VOL. 16

A Parallel Architecture for the Partitioning Around Medoids (PAM) Algorithm for Scalable Multi-Core Processor Implementation with Applications in Healthcare.
Hassan Mushtaq ... Amanullah Yasin
Sensors (Basel, Switzerland) | VOL. 18
Hassan Mushtaq, et. al.Hassan Mushtaq ... Amanullah Yasin
25 Nov 2018
Sensors (Basel, Switzerland) | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fast and eager [formula omitted]-medoids clustering: [formula omitted] runtime improvement of the PAM, CLARA, and CLARANS algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information Systems