S-RASTER: contraction clustering for evolving data streams

Gregor Ulm,Adrian Nilsson,Simon Smith,Emil Gustavsson,Mats Jirstrand

doi:10.1186/s40537-020-00336-3

Gregor Ulm, Adrian Nilsson + Show 3 more

Open Access

https://doi.org/10.1186/s40537-020-00336-3

Copy DOI

Abstract

Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.

Highlights

Clustering is a standard method for data analysis and many clustering methods have been proposed [29]
Tiles to which more than a predefined threshold number τ of data points have been projected are retained. These are referred to as significant tiles σ, which are subsequently clustered by exhaustive lookup of neighboring tiles in a depthfirst manner
The key takeaway of our original work on RASTER was that by carefully chosen trade-offs, we are able to process geospatial big data on a local workstation

Summary

Introduction

Clustering is a standard method for data analysis and many clustering methods have been proposed [29]. Some of the most well-known clustering algorithms are DBSCAN [9], k-means clustering [23], and CLIQUE [1, 2] They have in common that they do not perform well with big data, i.e. data that far exceeds available main memory [34]. We previously described RASTER and highlighted its performance for sequential processing of batch data [32] This was followed by a description of a parallel version of that algorithm [33]. The algorithm selects an arbitrary tile as the seed of a new cluster This cluster is grown iteratively by looking up all neighboring tiles within a given Manhattan or Chebyshev distance δ.

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Aug 13, 2020
Citations: 1	License type: open-access

R Discovery Prime

R Discovery Prime

S-RASTER: contraction clustering for evolving data streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Data Clustering
-
-
--
17 Aug 2022
17 Aug 2022

Data clustering in C++: an object-oriented approach

Choice Reviews Online | VOL. 49

01 Jan 2012
Choice Reviews Online | VOL. 49

Data Clustering in C++
Guojun Gan
-
Guojun GanGuojun Gan
28 Mar 2011
28 Mar 2011

Validation of Minimal Worst-Case Time Complexity by Stirling’s, Ramanujan’s, and Mortici’s Approximation
Anurag Dutta ... Manan Roy Choudhury
-
Anurag Dutta, et. al.Anurag Dutta ... Manan Roy Choudhury
27 May 2022
27 May 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

S-RASTER: contraction clustering for evolving data streams

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Big Data