Approximation Schemes for Clustering with Outliers

Zachary Friggstad,Kamyar Khodamoradi,Mohammad R Salavatipour,Mohsen Rezapour

doi:10.1145/3301446

Abstract

Clustering problems are well studied in a variety of fields, such as data science, operations research, and computer science. Such problems include variants of center location problems, k -median and k -means to name a few. In some cases, not all data points need to be clustered; some may be discarded for various reasons. For instance, some points may arise from noise in a dataset or one might be willing to discard a certain fraction of the points to avoid incurring unnecessary overhead in the cost of a clustering solution. We study clustering problems with outliers. More specifically, we look at uncapacitated facility location (UFL), k - median , and k - means . In these problems, we are given a set X of data points in a metric space δ(., .), a set C of possible centers (each maybe with an opening cost), maybe an integer parameter k , plus an additional parameter z as the number of outliers. In uncapacitated facility location with outliers, we have to open some centers, discard up to z points of X , and assign every other point to the nearest open center, minimizing the total assignment cost plus center opening costs. In k - median and k - means , we have to open up to k centers, but there are no opening costs. In k - means , the cost of assigning j to i is δ 2 ( j , i ). We present several results. Our main focus is on cases where δ is a doubling metric (this includes fixed dimensional Euclidean metrics as a special case) or is the shortest path metrics of graphs from a minor-closed family of graphs. For uniform-cost UFL with outliers on such metrics, we show that a multiswap simple local search heuristic yields a PTAS. With a bit more work, we extend this to bicriteria approximations for the k - median and k - means problems in the same metrics where, for any constant ϵ > 0, we can find a solution using (1 + ϵ) k centers whose cost is at most a (1 + ϵ)-factor of the optimum and uses at most z outliers. Our algorithms are all based on natural multiswap local search heuristics. We also show that natural local search heuristics that do not violate the number of clusters and outliers for k - median (or k - means ) will have unbounded gap even in Euclidean metrics. Furthermore, we show how our analysis can be extended to general metrics for k - means with outliers to obtain a (25 + ϵ, 1 + ϵ)-approximation: an algorithm that uses at most (1 + ϵ) k clusters and whose cost is at most 25 + ϵ of optimum and uses no more than z outliers.

Full Text