Deterministic Coresets for k-Means of Big Sparse Data

Artem Barger,Dan Feldman

doi:10.3390/a13040092

Abstract

Let P be a set of n points in R d , k ≥ 1 be an integer and ε ∈ ( 0 , 1 ) be a constant. An ε-coreset is a subset C ⊆ P with appropriate non-negative weights (scalars), that approximates any given set Q ⊆ R d of k centers. That is, the sum of squared distances over every point in P to its closest point in Q is the same, up to a factor of 1 ± ε to the weighted sum of C to the same k centers. If the coreset is small, we can solve problems such as k-means clustering or its variants (e.g., discrete k-means, where the centers are restricted to be in P, or other restricted zones) on the small coreset to get faster provable approximations. Moreover, it is known that such coreset support streaming, dynamic and distributed data using the classic merge-reduce trees. The fact that the coreset is a subset implies that it preserves the sparsity of the data. However, existing such coresets are randomized and their size has at least linear dependency on the dimension d. We suggest the first such coreset of size independent of d. This is also the first deterministic coreset construction whose resulting size is not exponential in d. Extensive experimental results and benchmarks are provided on public datasets, including the first coreset of the English Wikipedia using Amazon’s cloud.

Highlights

Given a set of n points in Rd, and an error parameter ε > 0, a coreset in this paper is a small set of weighted points in Rd, such that the sum of squared distances from the original set of points to any set of k centers in Rd can be approximated by the sum of weighted squared distances from the points in the coreset
Note that the coreset guarantees are preserved while using this technique, while no assumptions are made on the order of the streaming input points
We proved that any set of points in Rd has a (k, ε)-coreset which consists of a weighted subset of the input points whose size is independent of n and d, and polynomial in 1/ε

Summary

Background

Given a set of n points in Rd , and an error parameter ε > 0, a coreset in this paper is a small set of weighted points in Rd , such that the sum of squared distances from the original set of points to any set of k centers in Rd can be approximated by the sum of weighted squared distances from the points in the coreset. A coreset is a natural tool for handling Big Data using all the computation models that are mentioned in the previous section This is mainly due to the merge-and-reduce tree approach that was suggested by [2,3] and is formalized by [4]: coresets can be computed independently for subsets of input points, e.g., on different computers, and be merged and re-compressed again. The storage is linear in n since we need to save the tree in memory (practically, on the hard drive), the update time is only logarithmic in n since we need to reconstruct only O(log n) coresets that correspond to the deleted/inserted point along the tree First such coreset of size independent of d was introduced by [6]. Full opensource is available [8]

Related Work

Our Contribution

Solving k-Means Using k-Means

Running Time

Notation and Main Result

Coreset

Sparse Coresets

Coreset Construction

Proof of Correctness

Comparison to Existing Approaches

Datasets

The Experiment

On the Wikipedia Dataset

Results

Conclusions

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Apr 14, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Deterministic Coresets for k-Means of Big Sparse Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Online outlier detection for data streams
Shiblee Sadik ... Le Gruenwald
-
Shiblee Sadik, et. al.Shiblee Sadik ... Le Gruenwald
01 Jan 2010
01 Jan 2010

TransAct: Transfer learning enabled activity recognition
Md Abdullah Al Hafiz Khan ... Nirmalya Roy
-
Md Abdullah Al Hafiz Khan, et. al.Md Abdullah Al Hafiz Khan ... Nirmalya Roy
01 Mar 2017
01 Mar 2017

Automated Detection of Greenhouse Structures Using Cascade Mask R-CNN
Haeng Yeol Oh ... Muhammad Sarfraz Khan
Applied sciences | VOL. 12
Haeng Yeol Oh, et. al.Haeng Yeol Oh ... Muhammad Sarfraz Khan
30 May 2022
Applied sciences | VOL. 12

CHANGE SEMANTIC CONSTRAINED ONLINE DATA CLEANING METHOD FOR REAL-TIME OBSERVATIONAL DATA STREAM
Yulin Ding ... Rongrong Li
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B2
Yulin Ding, et. al.Yulin Ding ... Rongrong Li
07 Jun 2016
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences | VOL. XLI-B2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Deterministic Coresets for k-Means of Big Sparse Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms