RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Ruichang Li,Chunkai Wang,Honglei Zhu,Fan Liao

doi:10.1109/access.2020.2980554

Ruichang Li, Chunkai Wang + Show 2 more

Open Access

https://doi.org/10.1109/access.2020.2980554

Copy DOI

Abstract

Big data stream management systems often must transform a query application into multiple query tasks, simultaneously and dynamically partitioning data streams based on attribute values or partitioning keys. However, due to different partitioning orders or strategies of partitioning keys, the redundant and repetitive transmission of data streams at different nodes leads to system performance degradation. In addition, with the change of data skewness, the problem of unbalanced data stream partitioning still exists between different processing units within the same node. This paper presents the partitioning framework RCD+ (Runtime Correlation Discovery) according to runtime correlation discovery. RCD+ implements the full granularity partitioning strategy, which includes runtime positive correlation partitioning (RPC-partitioning) and clustering partitioning (Clu-partitioning). First, in the process of RPC-partitioning, we introduce the mini-batch scheme to reduce the number of output stream caches and partition data streams using the relevance of partitioning keys. Furthermore, in the process of Clu-partitioning, we re-partition data streams by clustering of skewed data streams between the inter-node and the intra-node. Then, we construct the routing table to manage partition states in order to ensure the correctness of multiple query tasks. Finally, we have implemented this framework on Apache Storm. Experiments with synthetic data and real data show that our proposed framework exhibits better query performance.

Highlights

Numerous contemporary applications require continuous querying and analysis, such as microblog analysis in social networks, high-frequency transaction monitoring in the financial field, and real-time recommendation in e-commerce
A big data stream management system (BDSMS) is used for on-line analysis and processing of real-time data streams. It is composed of the upper relational query system (RQS) and the lower stream processing system (SPS)
In order to effectively overcome the partitioning problem of data streams under multiple query tasks, this paper proposes a full granularity partitioning framework, RCD+

Summary

INTRODUCTION

Numerous contemporary applications require continuous querying and analysis, such as microblog analysis in social networks, high-frequency transaction monitoring in the financial field, and real-time recommendation in e-commerce. R. Li et al.: RCD+: Partitioning Method for Data Streams Based on Multiple Queries result in a one-second sliding step. Q3: SELECT roadID, SUM(speed)/COUNT(*) FROM GEOLIFE GROUP BY roadID WINDOW(SLIDING, 10, 1) For this example, as shown in Figure 1.a, if we build different plans for each query task, we need to replicate multiple data sources. 1) We introduce the mini-batch transmission of data streams based on the sliding window, and we conduct the compile-time optimization for multiple query tasks by finding the compatible partitioning key set for the different partitioning keys of each query. 2) We design the full granularity partitioning strategy, which includes the runtime positive correlation partitioning (RPC-partitioning) and the clustering partitioning (Clu-partitioning) This strategy can ensure the load balance between the inter-node and the intra-node of SPSs and improve the query efficiency.

PRELIMINARIES

PROBLEM DEFINITION

EVALUATION

RELATED WORK

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A statistical approach for clustering in streaming data
Niloofar Mozafari ... Sattar Hashemi
Artificial Intelligence Research | VOL. 3
Niloofar Mozafari, et. al.Niloofar Mozafari ... Sattar Hashemi
09 Jan 2014
Artificial Intelligence Research | VOL. 3

Temporal Structure Learning for Clustering Massive Data Streams in Real-Time
Michael Hahsler ... Margaret H Dunham
-
Michael Hahsler, et. al.Michael Hahsler ... Margaret H Dunham
28 Apr 2011
28 Apr 2011

On Density-Based Data Streams Clustering Algorithms: A Survey
Amineh Amini ... Hadi Saboohi
Journal of Computer Science and Technology | VOL. 29
Amineh Amini, et. al.Amineh Amini ... Hadi Saboohi
01 Jan 2014
Journal of Computer Science and Technology | VOL. 29

A Generic Summary Structure for Arbitrarily Oriented Subspace Clustering in Data Streams
Felix Borutta ... Peer Kröger
-
Felix Borutta, et. al.Felix Borutta ... Peer Kröger
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RCD+: A Partitioning Method for Data Streams Based on Multiple Queries

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access