CluSandra: A Framework and Algorithm for Data Stream Cluster Analysis

Josh R,Eman M

doi:10.14569/ijacsa.2011.021115

Abstract

The clustering or partitioning of a dataset’s records into groups of similar records is an important aspect of knowledge discovery from datasets. A considerable amount of research has been applied to the identification of clusters in very large multi-dimensional and static datasets. However, the traditional clustering and/or pattern recognition algorithms that have resulted from this research are inefficient for clustering data streams. A data stream is a dynamic dataset that is characterized by a sequence of data records that evolves over time, has extremely fast arrival rates and is unbounded. Today, the world abounds with processes that generate high-speed evolving data streams. Examples include click streams, credit card transactions and sensor networks. The data stream’s inherent characteristics present an interesting set of time and space related challenges for clustering algorithms. In particular, processing time is severely constrained and clustering algorithms must be performed in a single pass over the incoming data. This paper presents both a clustering framework and algorithm that, combined, address these challenges and allows end-users to explore and gain knowledge from evolving data streams. Our approach includes the integration of open source products that are used to control the data stream and facilitate the harnessing of knowledge from the data stream. Experimental results of testing the framework with various data streams are also discussed.

Highlights

According to the International Data Corporation (IDC), the size of the 2006 digital universe was 0.18 zettabytes1 and the IDC has forecasted a tenfold growth by 2011 to 1.8 zettabytes [17]
This paper describes that clustering algorithm and the distributed framework, which is entirely composed of off-the-shelf open source components
When working with data records whose attributes are of this data type, the records can be treated as n-dimensional vectors, where the similarity or dissimilarity between individual vectors is quantified by a distance measure

Summary

INTRODUCTION

According to the International Data Corporation (IDC), the size of the 2006 digital universe was 0.18 zettabytes and the IDC has forecasted a tenfold growth by 2011 to 1.8 zettabytes [17]. The unbounded and evolving nature of the data that is produced by the data stream, coupled with its varying and high-speed arrival rate, require that the data stream clustering algorithm embrace these properties: efficiency, scalability, availability, and reliability. One of the objectives of this work is to produce a distributed framework that addresses these properties and, facilitates the development of data stream clustering algorithms for this extreme environment. The combination of the CluSandra framework and algorithm provides a distributed, scalable and highly available clustering system that operates efficiently within the severe temporal and spatial constraints associated with real-time evolving data streams. Through the use of such a system, endusers can gain a deeper understanding of the data stream and its evolving nature in both near-time and over different time horizons

Data Stream

Cluster Analysis

RELATED WORK

CluStream

CLUSANDRA FRAMEWORK

Timeline Index

Message Queuing System

Microclustering Agent

Superclusters and Macroclustering

The Spring Framework

CLUSTER QUERY LANGUAGE

EMPIRICAL RESULTS

Test Environment and Datasets

CONCLUSIONS AND FUTURE WORK

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Advanced Computer Science and Applications	Publication Date: Jan 1, 2011
Citations: 10	License type: cc-by

R Discovery Prime

R Discovery Prime

CluSandra: A Framework and Algorithm for Data Stream Cluster Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications

Lead the way for us

Similar Papers

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams
Dalibor Krleža ... Boris Vrdoljak
Machine Learning | VOL. 110
Dalibor Krleža, et. al.Dalibor Krleža ... Boris Vrdoljak
04 Sep 2020
Machine Learning | VOL. 110

A Clustering Algorithm for Evolving Data Streams Using Temporal Spatial Hyper Cube
Redhwan Al-Amri ... Raja Kumar Murugesan
Applied sciences | VOL. 12
Redhwan Al-Amri, et. al.Redhwan Al-Amri ... Raja Kumar Murugesan
27 Jun 2022
Applied sciences | VOL. 12

A Data Stream Clustering Algorithm Based on Density and Extended Grid
Zheng Hua ... Shouning Qu
-
Zheng Hua, et. al.Zheng Hua ... Shouning Qu
01 Jan 2017
01 Jan 2017

Incremental density clustering framework based on dynamic microlocal clusters
Tao Zhang ... Jingya Dong
Intelligent Data Analysis | VOL. 27
Tao Zhang, et. al.Tao Zhang ... Jingya Dong
20 Nov 2023
Intelligent Data Analysis | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

CluSandra: A Framework and Algorithm for Data Stream Cluster Analysis

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Advanced Computer Science and Applications