Abstract

Let P be a set of n points in R d , k ≥ 1 be an integer and ε ∈ ( 0 , 1 ) be a constant. An ε-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± ε to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.

Highlights

  • Given a set of n points in Rd, and an error parameter ε > 0, a coreset in this paper is a small set of weighted points in Rd, such that the sum of squared distances from the original set of points to any set of k centers in Rd can be approximated by the sum of weighted squared distances from the points in the coreset

  • Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points

  • We proved that any set of points in Rd has a (k, ε)-coreset which consists of a weighted subset of the input points whose size is independent of n and d, and polynomial in 1/ε

Read more

Summary

Background

Given a set of n points in Rd , and an error parameter ε > 0, a coreset in this paper is a small set of weighted points in Rd , such that the sum of squared distances from the original set of points to any set of k centers in Rd can be approximated by the sum of weighted squared distances from the points in the coreset. A coreset is a natural tool for handling Big Data using all the computation models that are mentioned in the previous section This is mainly due to the merge-and-reduce tree approach that was suggested by [2,3] and is formalized by [4]: coresets can be computed independently for subsets of input points, e.g., on different computers, and be merged and re-compressed again. The storage is linear in n since we need to save the tree in memory (practically, on the hard drive), the update time is only logarithmic in n since we need to reconstruct only O(log n) coresets that correspond to the deleted/inserted point along the tree First such coreset of size independent of d was introduced by [6]. Full opensource is available [8]

Related Work
Our Contribution
Solving k-Means Using k-Means
Running Time
Notation and Main Result
Coreset
Sparse Coresets
Coreset Construction
Proof of Correctness
Comparison to Existing Approaches
Datasets
The Experiment
On the Wikipedia Dataset
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call