PStream: A Popularity-Aware Differentiated Distributed Stream Processing System

Hanhua Chen,Hai Jin,Fan Zhang

doi:10.1109/tc.2020.3019689

Abstract

Real-world stream data with skewed distributions raises unique challenges to distributed stream processing systems. Existing stream workload partitioning schemes usually use a “one size fits all” design, which leverages either a shuffle grouping or a key grouping strategy for partitioning the stream workloads among multiple processing units, leading to notable problems of unsatisfied system throughput and processing latency. In this article, we show that the key grouping based schemes result in serious load imbalance and low computation efficiency in the presence of data skewness while the shuffle grouping schemes are not scalable in terms of memory space. We argue that the key to efficient stream scheduling is the popularity of the stream data. We propose PStream, a popularity-aware differentiated distributed stream processing system which assigns the hot keys using shuffle grouping while assigns rare ones using key grouping. PStream leverages a novel light-weighted probabilistic counting scheme for identifying the currently hot keys in dynamic real-time streams. The scheme is extremely efficient in computation and memory consumption, so that the predictor based on it can be well integrated into processing instances in the system. We further design an adaptive threshold configuration scheme, which can quickly adapt to the dynamical popularity changes in highly dynamical real-time streams. We implement PStream on top of Apache Storm and conduct comprehensive experiments using large-scale traces from real-world systems to evaluate the performance of this design. Results show that PStream achieves a 2.3× improvement in terms of processing throughput and reduces the processing latency by 64 percent compared to state-of-the-art designs.

Highlights

Operator A Operator B Operator CThe recent advances in distributed stream processing systems such as Storm [1], Heron [2], Spark Streaming [3], S4 [4], and Samza [5], bring the community great capability to process extremely huge volumes of unbounded and continuous data streams in real-time with clusters [6, 7]
In distributed stream processing systems, to achieve high task parallelism and pipeline parallelism, an application is commonly modeled as a directed acyclic graph
To further improve the system throughput, distributed stream processing systems achieve data parallelism [13] by creating multiple instances for an operator and making them work in parallel (see Fig. 1(a))

Summary

INTRODUCTION

The recent advances in distributed stream processing systems such as Storm [1], Heron [2], Spark Streaming [3], S4 [4], and Samza [5], bring the community great capability to process extremely huge volumes of unbounded and continuous data streams in real-time with clusters [6, 7]. Such a scheme leads to load imbalance due to the skewed distribution in various real-world datasets [15]. We collect real-world traces from different systems to shows that the throughput decreases significantly with the increase of the level of skewness of the stream data during key grouping. It is difficult to meet the rigorous requirements of both computation and memory efficiency needed by a distributed stream processing system To address this issue, in PStream, we design a novel light-weighted predictor for identifying the current hot keys in the real-time data streams.

RELATED WORK

Design Overview

Hot Keys Predictor

Precision of the Hot Key Prediction

Adaption to Dynamic Popularity Changes

The Stableness of the Synopsis

Adaptive Threshold Configuration

IMPLEMENTATION

Findings

Experiment Setups

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Computers	Publication Date: Aug 26, 2020
Citations: 8	License type: CC BY 4.0

R Discovery Prime

PStream: A Popularity-Aware Differentiated Distributed Stream Processing System

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Transactions on Computers

Lead the way for us

Similar Papers

Popularity-aware differentiated distributed stream processing on skewed streams
Hanhua Chen ... Fan Zhang
-
Hanhua Chen, et. al.Hanhua Chen ... Fan Zhang
01 Oct 2017
01 Oct 2017

OEBench: Investigating Open Environment Challenges in Real-World Relational Data Streams
Yiqun Diao ... Yutong Yang
Proceedings of the VLDB Endowment | VOL. 17
Yiqun Diao, et. al.Yiqun Diao ... Yutong Yang
01 Feb 2024
Proceedings of the VLDB Endowment | VOL. 17

Load adaptive and fault tolerant distributed stream processing system for explosive stream data
Myungcheol Lee ... Miyoung Lee
-
Myungcheol Lee, et. al.Myungcheol Lee ... Miyoung Lee
01 Jan 2015
01 Jan 2015

Load adaptive and fault tolerant distributed stream processing system for explosive stream data
Myungcheol Lee ... Sung Jin Hur
-
Myungcheol Lee, et. al.Myungcheol Lee ... Sung Jin Hur
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

PStream: A Popularity-Aware Differentiated Distributed Stream Processing System

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: IEEE Transactions on Computers