On the k-means/median cost function

Anup Bhattacharya,Yoav Freund,Ragesh Jaiswal

doi:10.1016/j.ipl.2022.106252

Abstract

In this work, we study the k-means cost function. Given a dataset X⊆Rd and an integer k, the goal of the Euclidean k-means problem is to find a set of k centers C⊆Rd such that Φ(C,X)≡∑x∈Xminc∈C⁡‖x−c‖2 is minimized. Let Δ(X,k)≡minC⊆Rd⁡Φ(C,X) denote the cost of the optimal k-means solution. For any dataset X, Δ(X,k) decreases as k increases. In this work, we try to understand this behavior more precisely. For any dataset X⊆Rd, integer k≥1, and a precision parameter ε>0, let L(X,k,ε) denote the smallest integer such that Δ(X,L(X,k,ε))≤ε⋅Δ(X,k). We show upper and lower bounds on this quantity. Our techniques generalize for the metric k-median problem in metric spaces with bounded doubling dimension. Finally, we observe that for any dataset X, we can compute a set S of size O(L(X,k,ε/c)) using D2-sampling such that Φ(S,X)≤ε⋅Δ(X,k) for some fixed constant c. Some of the applications include new pseudo-approximation guarantees for k-means++ and bounds for movement-based coreset constructions.

Full Text