Abstract

The $k$-means method is the method of choice for clustering large-scale data sets and it performs exceedingly well in practice despite its exponential worst-case running-time. To narrow the gap between theory and practice, $k$-means has been studied in the semi-random input model of smoothed analysis, which often leads to more realistic conclusions than mere worst-case analysis. For the case that $n$ data points in $\mathbb{R}^d$ are perturbed by Gaussian noise with standard deviation $\sigma$, it has been shown that the expected running-time is bounded by a polynomial in $n$ and $1/\sigma$. This result assumes that squared Euclidean distances are used as the distance measure. In many applications, however, data is to be clustered with respect to Bregman divergences rather than squared Euclidean distances. A prominent example is the Kullback-Leibler divergence (a.k.a. relative entropy) that is commonly used to cluster web pages. To broaden the knowledge about this important class of distance measures, we analyze the running-time of the $k$-means method for Bregman divergences. We first give a smoothed analysis of $k$-means with (almost) arbitrary Bregman divergences, and we show bounds of poly($n^{\sqrt{k}}$), $1/\sigma$) and $k^{kd}$.poly($n$,$1/\sigma$). The latter yields a polynomial bound if $k$ and $d$ are small compared to $n$. On the other hand, we show that the exponential lower bound carries over to a huge class of Bregman divergences.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.