Adaptive Density-Based Spatial Clustering for Massive Data Analysis

Zihao Cai,Jian Wang,Kejing He

doi:10.1109/access.2020.2969440

Abstract

Clustering is a classical research field due to its broad applications in data mining such as emotion detection, event extraction and topic discovery. It aims to discover intrinsic patterns which can be formed as clusters from a collection of data. Significant progress have been made by the Density-based Spatial Clustering of Applications with Noise (DBSCAN) and its variants. However, there is a major limitation that current density-based algorithms suffer from linear connection problem, where they perform poorly to discriminate objective clusters which are “connected” by a few data points. Moreover, the parameter setting and the time cost make it hard to be well-adapted in massive data analysis. To address these problems, we propose a novel adaptive density-based spatial clustering algorithm called Ada-DBSCAN, which consists of a data block splitter and a data block merger, coordinated by local clustering and global clustering. We conduct extensive experiments on both artificial and real-world datasets to evaluate the effectiveness of Ada-DBSCAN. Experimental results show that our algorithm evidently outperforms several strong baselines in both clustering accuracy and human evaluation. Besides, Ada-DBSCAN shows significant improvement of efficiency compared with DBSCAN.

Highlights

With the growing of large collection of data in various domains like business management and cloud services [1], much attention has been given to data mining algorithms, which can be applied to many tasks such as event detection [2], personalized recommendation [3] and the Internet of Things (IoT) [4]
To deal with the aforementioned limitations, we propose a novel Adaptive Density-based Spatial Clustering of Applications with Noise (Ada-DBSCAN), which consists of a data block splitter and a data block merger, coordinated by local clustering and global clustering
We propose Ada-DBSCAN, a novel adaptive densitybased clustering algorithm which leverages the idea of data splitting and data merging to dynamically discover clusters from local to global, making it well-suited for addressing the issue of linear connection

Summary

Introduction

With the growing of large collection of data in various domains like business management and cloud services [1], much attention has been given to data mining algorithms, which can be applied to many tasks such as event detection [2], personalized recommendation [3] and the Internet of Things (IoT) [4]. As a typical data mining algorithm, clustering shows broad applications in data analysis, where density-based clustering algorithms play a crucial role [5]. It is the advantage of DBSCAN that makes it require no prior knowledge of the number of clusters in contrast to distance-based clustering algorithms like K-Means [8].

Methods

Results

Conclusion