Worst-case and smoothed analysis of k-means clustering with Bregman divergences

Bodo Manthey ,Heiko Röglin

doi:10.20382/jocg.v4i1a5

Abstract

The $k$-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, $k$-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that $n$ data points in $\mathbb{R}^d$ are perturbed by Gaussian noise with standard deviation $\sigma$, it has been shown that the expected running-time is bounded by a polynomial in $n$ and $1/\sigma$. This result assumes that squared Euclidean distances are used as the distance measure. In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the $k$-means method for Bregman divergences. We first give a smoothed analysis of $k$-means with (almost) arbitrary Bregman divergences, and we show bounds of poly($n^{\sqrt{k}}$), $1/\sigma$) and $k^{kd}$.poly($n$,$1/\sigma$). The latter yields a polynomial bound if $k$ and $d$ are small compared to $n$. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.

Full Text