Abstract

Data growth in today’s world is exponential, many applications generate huge amount of data streams at very high speed such as smart grids, sensor networks, video surveillance, financial systems, medical science data, web click streams, network data, etc. In the case of traditional data mining, the data set is generally static in nature and available many times for processing and analysis. However, data stream mining has to satisfy constraints related to real-time response, bounded and limited memory, single-pass, and concept-drift detection. The main problem is identifying the hidden pattern and knowledge for understanding the context for identifying trends from continuous data streams. In this paper, various data stream methods and algorithms are reviewed and evaluated on standard synthetic data streams and real-life data streams. Density-micro clustering and density-grid-based clustering algorithms are discussed and comparative analysis in terms of various internal and external clustering evaluation methods is performed. It was observed that a single algorithm cannot satisfy all the performance measures. The performance of these data stream clustering algorithms is domain-specific and requires many parameters for density and noise thresholds.

Highlights

  • Nowadays automation is in almost every domain and transactions of everyday life are recorded at high speed

  • The statistics for individual cluster, such as centroid x0, radius R, and diameter D is used recursively with multiphase clustering technique. These phases are: Phase 1: BRICH generates multilevel Clustering Feature (CF)-tree by preventing data’s inherent structure, consists of compress data during initial scan Phase 2: Clustering algorithm is applied staring from leaf modes of the CF-tree, this will remove sparse clusters as noise or outliers and dense nodes are grouped into clusters

  • The basic definitions in DBSCAN are introduced in the following, where D is a current set of data points: Basic Definition

Read more

Summary

Introduction

Nowadays automation is in almost every domain and transactions of everyday life are recorded at high speed. Some of the review papers discuss density-based clustering techniques on data streams [20]. A survey in a past paper [21] discussed a review on density-based clustering techniques and methods for evolving data-streams. The authors have surveyed clustering algorithms used in different domains and their applications on benchmark datasets and computational problems. They discussed many closely correlated topics such as cluster validation and proximity measures. Their focus is on clustering techniques based on MapReduce and parallel classification using MapReduce The authors of another past paper [22] used taxonomy and empirical analysis to survey clustering algorithms on big data.

Clustering Techniques
Partitional Clustering
Hierarchical Clustering
Density-Based
Grid-Based
Model-Based
Hierarchical Clustering Method
Density-Based Clustering Method
Grid-Based Clustering Method
Model-Based Clustering Methods
Evaluation Clustering Methods
Determining Number of Clusters in a Data Set
Measuring Clustering Quality
The Internal Measures for Evaluation of Clustering Quality
F Measure
The External Measure for Evaluation of Clustering Quality
Challenging Issues and Comparison
Experimentation with Data Streams
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call