Faster Algorithms for the Constrained k-means Problem

Anup Bhattacharya,Amit Kumar,Ragesh Jaiswal

doi:10.1007/s00224-017-9820-7

Anup Bhattacharya, Amit Kumar + Show 1 more

Open Access

https://doi.org/10.1007/s00224-017-9820-7

Copy DOI

Abstract

The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r -gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O 1, ..., O k are an arbitrary partition of the dataset and the goal is to output k-centers c 1, ..., c k such that the objective function ${\sum }_{i = 1}^{k} {\sum }_{x \in O_{i}} ||x - c_{i}||^{2}$ is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter e > 0, let l denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + e) approximation w.r.t. the objective function above. In this paper, we show an upper bound on l by giving a randomized algorithm that outputs a list of $2^{\tilde {O}(k/\varepsilon )}$ k-centers. We also give a closely matching lower bound of $2^{\tilde {\Omega }(k/\sqrt {\varepsilon })}$ . Moreover, our algorithm runs in time $O \left (n d \cdot 2^{\tilde {O}(k/\varepsilon )} \right )$ . This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(n d ⋅ (log n) k ⋅ 2 p o l y(k/e)) and output a list of size O((log n) k ⋅ 2 p o l y(k/e)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.

Highlights

Clustering problems intend to classify high dimensional data based on the proximity of points to each other
Leibniz International Proceedings in Informatics Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany 16:2 Faster Algorithms for the Constrained k-Means Problem we model such problems by the notion of a center based clustering problem
The k-means problem is defined in the following manner: given a dataset X = {x1, . . . , xn} ⊂ Rd and an integer k, output a set of k centers {c1, . . . , ck} ⊂ Rd such that the objective function x∈X minc∈{c1,...,ck} ||x − c||2 is minimized

Summary

Introduction

Clustering problems intend to classify high dimensional data based on the proximity of points to each other. These included the so-called r-gather k-means, r-capacity k-means and l-diversity k-means problems Their approach for solving such problems was to output a list of candidate sets of centers (of size k) such that at least one of these were close to the optimal centers. We formalize this approach and show that if k is small, one can obtain a PTAS for the constrained k-means (and the constrained k-median) problems whose running time is linear plus a constant number of calls to AC. We obtain as corollary of our main result efficient algorithms for the constrained k-means (and the constrained k-median) problems

Related Work

Preliminaries

Our Results

Our Techniques

The Algorithm

Analysis

Case I

Case II:

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Theory of Computing Systems	Publication Date: Nov 6, 2017
Citations: 38	License type: cc-by

R Discovery Prime

R Discovery Prime

Faster Algorithms for the Constrained k-means Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Theory of Computing Systems

Lead the way for us

Similar Papers

Faster Algorithms for the Constrained k-Means Problem
...
-
, et. al. ...
01 Jan 2015
01 Jan 2015

Approximate Correlation Clustering Using Same-Cluster Queries
Nir Ailon ... Ragesh Jaiswal
-
Nir Ailon, et. al.Nir Ailon ... Ragesh Jaiswal
01 Jan 2018
01 Jan 2018

Graph Clustering in All Parameter Regimes
...
-
, et. al. ...
08 Sep 2020
08 Sep 2020

MEACCP: A membrane evolutionary algorithm for capacitated clustering problem
Yaoyao Liu ... Yi Zeng
Information sciences | VOL. 591
Yaoyao Liu, et. al.Yaoyao Liu ... Yi Zeng
22 Jan 2022
Information sciences | VOL. 591

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Faster Algorithms for the Constrained k-means Problem

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Theory of Computing Systems