Coreset Construction Research Articles

The min-sum k -clustering problem is to partition a metric space (P,d) into k clusters C 1,…,C k ⊆P such that $\sum_{i=1}^{k}\sum_{p,q\in C_{i}}d(p,q)$is minimized. We show the first efficient construction of a coreset for this problem. Our coreset construction is based on a new adaptive sampling algorithm. With our construction of coresets we obtain two main algorithmic results. The first result is a sublinear-time (4+e)-approximation algorithm for the min-sum k-clustering problem in metric spaces. The running time of this algorithm is $\widetilde{{\mathcal{O}}}(n)$for any constant k and e, and it is o(n 2) for all k=o(log n/log log n). Since the full description size of the input is Θ(n 2), this is sublinear in the input size. The fastest previously known o(log n)-factor approximation algorithm for k>2 achieved a running time of Ω(n k ), and no non-trivial o(n 2)-time algorithm was known before. Our second result is the first pass-efficient data streaming algorithm for min-sum k-clustering in the distance oracle model, i.e., an algorithm that uses poly(log n,k) space and makes 2 passes over the input point set, which arrives in form of a data stream in arbitrary order. It computes an implicit representation of a clustering of (P,d) with cost at most a constant factor larger than that of an optimal partition. Using one further pass, we can assign each point to its corresponding cluster. To develop the coresets, we introduce the concept of α -preserving metric embeddings. Such an embedding satisfies properties that the distance between any pair of points does not decrease and the cost of an optimal solution for the considered problem on input (P,d′) is within a constant factor of the optimal solution on input (P,d). In other words, the goal is to find a metric embedding into a (structurally simpler) metric space that approximates the original metric up to a factor of α with respect to a given problem. We believe that this concept is an interesting generalization of coresets.

In this paper we develop an efficient implementation for a k-means clustering algorithm. The algorithm is based on a combination of Lloyd's algorithm with random swapping of centers to avoid local minima. This approach was proposed by Mount 30. The novel feature of our algorithms is the use of coresets to speed up the algorithm. A coreset is a small weighted set of points that approximates the original point set with respect to the considered problem. We use a coreset construction described in 12. Our algorithm first computes a solution on a very small coreset. Then in each iteration the previous solution is used as a starting solution on a refined, i.e. larger, coreset. To evaluate the performance of our algorithm we compare it with algorithm KMHybrid 30 on typical 3D data sets for an image compression application and on artificially created instances. Our data sets consist of 300,000 to 4.9 million points. Our algorithm outperforms KMHybrid on most of these input instances. Additionally, the quality of the solutions computed by our algorithm deviates significantly less than that of KMHybrid. We conclude that the use of coresets has two effects. First, it can speed up algorithms significantly. Secondly, in variants of Lloyd's algorithm, it reduces the dependency on the starting solution and thus makes the algorithm more stable. Finally, we propose the use of coresets as a heuristic to approximate the average silhouette coefficient of clusterings. The average silhouette coefficient is a measure for the quality of a clustering that is independent of the number of clusters k. Hence, it can be used to compare the quality of clusterings for different sizes of k. To show the applicability of our approach we computed clusterings and approximate average silhouette coefficient for k = 1,…, 100 for our input instances and discuss the performance of our algorithm in detail.

Coreset Construction Research Articles

Articles published on Coreset Construction

Small Space Representations for Metric Min-sum k-Clustering and Their Applications

A FAST k-MEANS IMPLEMENTATION USING CORESETS

Faster core-set constructions and data-stream algorithms in fixed dimensions

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Coreset Construction Research Articles

Articles published on Coreset Construction

Small Space Representations for Metric Min-sum k-Clustering and Their Applications

A FAST k-MEANS IMPLEMENTATION USING CORESETS

Faster core-set constructions and data-stream algorithms in fixed dimensions