System-aware dynamic partitioning for batch and streaming workloads

Zoltán Zvara,Balázs Barnabás Lóránt,András A Benczúr,Péter G N Szabó

doi:10.1145/3468737.3494087

Abstract

When processing data streams with highly skewed and nonstationary key distributions, we often observe overloaded partitions when the hash partitioning fails to balance data correctly. To avoid slow tasks that delay the completion of the whole stage of computation, it is necessary to apply adaptive, on-the-fly partitioning that continuously recomputes an optimal partitioner, given the observed key distribution. While such solutions exist for batch processing of static data sets and stateless stream processing, the task is difficult for long-running stateful streaming jobs where key distribution changes over time. Careful checkpointing and operator state migration is necessary to change the partitioning while the operation is running. Our key result is a lightweight on-the-fly Dynamic Repartitioning (DR) module for distributed data processing systems (DDPS), including Apache Spark and Flink, which improves the performance with negligible overhead. DR can adaptively repartition data during execution using our Key Isolator Partitioner (KIP). In our experiments with real workloads and power-law distributions, we reach a speedup of 1.5-6 for a variety of Spark and Flink jobs.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

System-aware dynamic partitioning for batch and streaming workloads

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Optimizing distributed data stream processing by tracing
Zoltán Zvara ... András Benczúr
Future Generation Computer Systems | VOL. 90
Zoltán Zvara, et. al.Zoltán Zvara ... András Benczúr
02 Aug 2018
Future Generation Computer Systems | VOL. 90

Oscar: Small-World Overlay for Realistic Key Distributions
Sarunas Girdzijauskas ... Karl Aberer
-
Sarunas Girdzijauskas, et. al.Sarunas Girdzijauskas ... Karl Aberer
01 Jan 2013
01 Jan 2013

LocationSpark
Mingjie Tang ... Mourad Ouzzani
Proceedings of the VLDB Endowment | VOL. 9
Mingjie Tang, et. al.Mingjie Tang ... Mourad Ouzzani
01 Sep 2016
Proceedings of the VLDB Endowment | VOL. 9

Method of planning data processing tasks in distributed systems with limited information about available resources
Andrii Kozyriev ... Ihor Shubin
Innovative Technologies and Scientific Solutions for Industries | VOL. -
Andrii Kozyriev, et. al.Andrii Kozyriev ... Ihor Shubin
30 Sep 2023
Innovative Technologies and Scientific Solutions for Industries | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

System-aware dynamic partitioning for batch and streaming workloads

Abstract

Talk to us

Similar Papers