Hierarchical Density-Based Clustering Using MapReduce

Joelson Antonio Dos Santos,Murilo C Naldi,Joerg Sander,Talat Iqbal Syed,Ricardo J G B Campello

doi:10.1109/tbdata.2019.2907624

Abstract

Hierarchical density-based clustering is a powerful tool for exploratory data analysis, which can play an important role in the understanding and organization of datasets. However, its applicability to large datasets is limited because the computational complexity of hierarchical clustering methods has a quadratic lower bound in the number of objects to be clustered. MapReduce is a popular programming model to speed up data mining and machine learning algorithms operating on large, possibly distributed datasets. In the literature, there have been attempts to parallelize algorithms such as Single-Linkage, which in principle can also be extended to the broader scope of hierarchical density-based clustering, but hierarchical clustering algorithms are inherently difficult to parallelize with MapReduce. In this paper, we discuss why adapting previous approaches to parallelize Single-Linkage clustering using MapReduce leads to very inefficient solutions when one wants to compute density-based clustering hierarchies. Preliminarily, we discuss one such solution, which is based on an exact, yet very computationally demanding, random blocks parallelization scheme. To be able to efficiently apply hierarchical density-based clustering to large datasets using MapReduce, we then propose a different parallelization scheme that computes an approximate clustering hierarchy based on a much faster, recursive sampling approach. This approach is based on HDBSCAN*, the state-of-the-art hierarchical density-based clustering algorithm, combined with a data summarization technique called data bubbles. The proposed method is evaluated in terms of both runtime and quality of the approximation on a number of datasets, showing its effectiveness and scalability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Hierarchical Density-Based Clustering Using MapReduce

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Big Data

Lead the way for us

Journal: IEEE Transactions on Big Data	Publication Date: Mar 1, 2021
Citations: 65

Similar Papers

An Effective and Efficient Constrained Ward’s Hierarchical Agglomerative Clustering Method
Abeer A Aljohani ... Eran A Edirisinghe
-
Abeer A Aljohani, et. al.Abeer A Aljohani ... Eran A Edirisinghe
24 Aug 2019
24 Aug 2019

An efficient hierarchical clustering model for grouping web transactions
Darenna Syahida Suib ... Mustafa Mat Deris
International Journal of Business Intelligence and Data Mining | VOL. 3
Darenna Syahida Suib, et. al.Darenna Syahida Suib ... Mustafa Mat Deris
01 Jan 2008
International Journal of Business Intelligence and Data Mining | VOL. 3

A Kind of Hierarchical K-Means Web Log Clustering Algorithm
Li Xia Liu ... Yi Qi Zhuang
Key Engineering Materials | VOL. 439-440
Li Xia Liu, et. al.Li Xia Liu ... Yi Qi Zhuang
01 Jun 2010
Key Engineering Materials | VOL. 439-440

Performance Comparison with Hierarchical and Partitional Clustering Methods
Ozer Ozdemir ... Simgenur Cerman
WSEAS TRANSACTIONS ON COMMUNICATIONS | VOL. 20
Ozer Ozdemir, et. al.Ozer Ozdemir ... Simgenur Cerman
27 Dec 2021
WSEAS TRANSACTIONS ON COMMUNICATIONS | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Density-Based Clustering Using MapReduce

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Big Data