Abstract

In this work, we study the k-means cost function. Given a dataset X⊆Rd and an integer k, the goal of the Euclidean k-means problem is to find a set of k centers C⊆Rd such that Φ(C,X)≡∑x∈Xminc∈C⁡‖x−c‖2 is minimized. Let Δ(X,k)≡minC⊆Rd⁡Φ(C,X) denote the cost of the optimal k-means solution. For any dataset X, Δ(X,k) decreases as k increases. In this work, we try to understand this behavior more precisely. For any dataset X⊆Rd, integer k≥1, and a precision parameter ε>0, let L(X,k,ε) denote the smallest integer such that Δ(X,L(X,k,ε))≤ε⋅Δ(X,k). We show upper and lower bounds on this quantity. Our techniques generalize for the metric k-median problem in metric spaces with bounded doubling dimension. Finally, we observe that for any dataset X, we can compute a set S of size O(L(X,k,ε/c)) using D2-sampling such that Φ(S,X)≤ε⋅Δ(X,k) for some fixed constant c. Some of the applications include new pseudo-approximation guarantees for k-means++ and bounds for movement-based coreset constructions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call