Abstract
Contraction Clustering (RASTER) is a single-pass algorithm for density-based clustering of 2D data. It can process arbitrary amounts of data in linear time and in constant memory, quickly identifying approximate clusters. It also exhibits good scalability in the presence of multiple CPU cores. RASTER exhibits very competitive performance compared to standard clustering algorithms, but at the cost of decreased precision. Yet, RASTER is limited to batch processing and unable to identify clusters that only exist temporarily. In contrast, S-RASTER is an adaptation of RASTER to the stream processing paradigm that is able to identify clusters in evolving data streams. This algorithm retains the main benefits of its parent algorithm, i.e. single-pass linear time cost and constant memory requirements for each discrete time step within a sliding window. The sliding window is efficiently pruned, and clustering is still performed in linear time. Like RASTER, S-RASTER trades off an often negligible amount of precision for speed. Our evaluation shows that competing algorithms are at least 50% slower. Furthermore, S-RASTER shows good qualitative results, based on standard metrics. It is very well suited to real-world scenarios where clustering does not happen continually but only periodically.
Highlights
Clustering is a standard method for data analysis and many clustering methods have been proposed [29]
Tiles to which more than a predefined threshold number τ of data points have been projected are retained. These are referred to as significant tiles σ, which are subsequently clustered by exhaustive lookup of neighboring tiles in a depthfirst manner
The key takeaway of our original work on RASTER was that by carefully chosen trade-offs, we are able to process geospatial big data on a local workstation
Summary
Clustering is a standard method for data analysis and many clustering methods have been proposed [29]. Some of the most well-known clustering algorithms are DBSCAN [9], k-means clustering [23], and CLIQUE [1, 2] They have in common that they do not perform well with big data, i.e. data that far exceeds available main memory [34]. We previously described RASTER and highlighted its performance for sequential processing of batch data [32] This was followed by a description of a parallel version of that algorithm [33]. The algorithm selects an arbitrary tile as the seed of a new cluster This cluster is grown iteratively by looking up all neighboring tiles within a given Manhattan or Chebyshev distance δ.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.