Abstract

Real-world stream data with skewed distributions raises unique challenges to distributed stream processing systems. Existing stream workload partitioning schemes usually use a “one size fits all” design, which leverages either a shuffle grouping or a key grouping strategy for partitioning the stream workloads among multiple processing units, leading to notable problems of unsatisfied system throughput and processing latency. In this article, we show that the key grouping based schemes result in serious load imbalance and low computation efficiency in the presence of data skewness while the shuffle grouping schemes are not scalable in terms of memory space. We argue that the key to efficient stream scheduling is the popularity of the stream data. We propose PStream, a popularity-aware differentiated distributed stream processing system which assigns the hot keys using shuffle grouping while assigns rare ones using key grouping. PStream leverages a novel light-weighted probabilistic counting scheme for identifying the currently hot keys in dynamic real-time streams. The scheme is extremely efficient in computation and memory consumption, so that the predictor based on it can be well integrated into processing instances in the system. We further design an adaptive threshold configuration scheme, which can quickly adapt to the dynamical popularity changes in highly dynamical real-time streams. We implement PStream on top of Apache Storm and conduct comprehensive experiments using large-scale traces from real-world systems to evaluate the performance of this design. Results show that PStream achieves a 2.3× improvement in terms of processing throughput and reduces the processing latency by 64 percent compared to state-of-the-art designs.

Highlights

  • Operator A Operator B Operator CThe recent advances in distributed stream processing systems such as Storm [1], Heron [2], Spark Streaming [3], S4 [4], and Samza [5], bring the community great capability to process extremely huge volumes of unbounded and continuous data streams in real-time with clusters [6, 7]

  • In distributed stream processing systems, to achieve high task parallelism and pipeline parallelism, an application is commonly modeled as a directed acyclic graph

  • To further improve the system throughput, distributed stream processing systems achieve data parallelism [13] by creating multiple instances for an operator and making them work in parallel (see Fig. 1(a))

Read more

Summary

INTRODUCTION

The recent advances in distributed stream processing systems such as Storm [1], Heron [2], Spark Streaming [3], S4 [4], and Samza [5], bring the community great capability to process extremely huge volumes of unbounded and continuous data streams in real-time with clusters [6, 7]. Such a scheme leads to load imbalance due to the skewed distribution in various real-world datasets [15]. We collect real-world traces from different systems to shows that the throughput decreases significantly with the increase of the level of skewness of the stream data during key grouping. It is difficult to meet the rigorous requirements of both computation and memory efficiency needed by a distributed stream processing system To address this issue, in PStream, we design a novel light-weighted predictor for identifying the current hot keys in the real-time data streams.

RELATED WORK
Design Overview
Hot Keys Predictor
Precision of the Hot Key Prediction
Adaption to Dynamic Popularity Changes
The Stableness of the Synopsis
Adaptive Threshold Configuration
IMPLEMENTATION
Findings
Experiment Setups
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call