A Performance Analysis of Fault Recovery in Stream Processing Frameworks

Giselle Van Dongen,Dirk Van Den Poel

doi:10.1109/access.2021.3093208

Giselle Van Dongen, Dirk Van Den Poel

Open Access

https://doi.org/10.1109/access.2021.3093208

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 16	License type: CC BY 4.0

Affiliation: Ghent University

Abstract

Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover within 30 seconds and in most cases without necessitating an application restart. Kafka Streams has only a few seconds of downtime, but is slower at catching up on delays. Finally, Flink can offer end-to-end exactly-once semantics at a low cost but requires job restarts for most failures leading to high recovery times of around 50 seconds.

Highlights

T HE demand for near real-time processing has been soaring in the last decade with the rise of the IoT domain and a surge in time-sensitive use cases such as fraud detection and monitoring
We can conclude that high-availability setups are crucial to ensure the continuation and restart of processing applications in both Flink and Spark
Spark Streaming and Structured Streaming do not notice any impact of a master failures

Summary

Introduction

T HE demand for near real-time processing has been soaring in the last decade with the rise of the IoT domain and a surge in time-sensitive use cases such as fraud detection and monitoring. Often, this requires processing large volumes of data for which a single machine does not suffice or becomes increasingly expensive. Many benchmarks were developed to study performance differences between these frameworks, e.g. The distributed architecture of these frameworks implies that several components can fail. Both Spark frameworks and Flink use a master for job scheduling. We experiment with enabling exactly-once semantics to determine the performance impact

Objectives

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Performance Analysis of Fault Recovery in Stream Processing Frameworks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Fuzzy iterative learning fault tolerant control for batch processes with different types of actuator faults
Limin Wang ... Wangxi Zhang
-
Limin Wang, et. al.Limin Wang ... Wangxi Zhang
17 Dec 2021
17 Dec 2021

A useful methodology for analyzing distance relays performance during simple and inter-circuit faults in multi-circuit lines
M Agrasar ... R Alvarez
IEEE Transactions on Power Delivery | VOL. 12
M Agrasar, et. al.M Agrasar ... R Alvarez
01 Jan 1997
IEEE Transactions on Power Delivery | VOL. 12

A Comprehensive Model of Induction Motor for emulating different electrical Faults
Susanta Ray ... Toulik Chakraborty
-
Susanta Ray, et. al.Susanta Ray ... Toulik Chakraborty
01 Dec 2018
01 Dec 2018

A Generalized Machine Fault Detection Method Using Unified Change Detection
Wenyi Wang ... David Forrester
Annual Conference of the PHM Society | VOL. 6
Wenyi Wang, et. al.Wenyi Wang ... David Forrester
29 Sep 2014
Annual Conference of the PHM Society | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Performance Analysis of Fault Recovery in Stream Processing Frameworks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access