Abstract

Big data stream management systems often must transform a query application into multiple query tasks, simultaneously and dynamically partitioning data streams based on attribute values or partitioning keys. However, due to different partitioning orders or strategies of partitioning keys, the redundant and repetitive transmission of data streams at different nodes leads to system performance degradation. In addition, with the change of data skewness, the problem of unbalanced data stream partitioning still exists between different processing units within the same node. This paper presents the partitioning framework RCD+ (Runtime Correlation Discovery) according to runtime correlation discovery. RCD+ implements the full granularity partitioning strategy, which includes runtime positive correlation partitioning (RPC-partitioning) and clustering partitioning (Clu-partitioning). First, in the process of RPC-partitioning, we introduce the mini-batch scheme to reduce the number of output stream caches and partition data streams using the relevance of partitioning keys. Furthermore, in the process of Clu-partitioning, we re-partition data streams by clustering of skewed data streams between the inter-node and the intra-node. Then, we construct the routing table to manage partition states in order to ensure the correctness of multiple query tasks. Finally, we have implemented this framework on Apache Storm. Experiments with synthetic data and real data show that our proposed framework exhibits better query performance.

Highlights

  • Numerous contemporary applications require continuous querying and analysis, such as microblog analysis in social networks, high-frequency transaction monitoring in the financial field, and real-time recommendation in e-commerce

  • A big data stream management system (BDSMS) is used for on-line analysis and processing of real-time data streams. It is composed of the upper relational query system (RQS) and the lower stream processing system (SPS)

  • In order to effectively overcome the partitioning problem of data streams under multiple query tasks, this paper proposes a full granularity partitioning framework, RCD+

Read more

Summary

INTRODUCTION

Numerous contemporary applications require continuous querying and analysis, such as microblog analysis in social networks, high-frequency transaction monitoring in the financial field, and real-time recommendation in e-commerce. R. Li et al.: RCD+: Partitioning Method for Data Streams Based on Multiple Queries result in a one-second sliding step. Q3: SELECT roadID, SUM(speed)/COUNT(*) FROM GEOLIFE GROUP BY roadID WINDOW(SLIDING, 10, 1) For this example, as shown in Figure 1.a, if we build different plans for each query task, we need to replicate multiple data sources. 1) We introduce the mini-batch transmission of data streams based on the sliding window, and we conduct the compile-time optimization for multiple query tasks by finding the compatible partitioning key set for the different partitioning keys of each query. 2) We design the full granularity partitioning strategy, which includes the runtime positive correlation partitioning (RPC-partitioning) and the clustering partitioning (Clu-partitioning) This strategy can ensure the load balance between the inter-node and the intra-node of SPSs and improve the query efficiency.

PRELIMINARIES
PROBLEM DEFINITION
EVALUATION
RELATED WORK
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.