Abstract
Distributed stream processing frameworks have gained widespread adoption in the last decade because they abstract away the complexity of parallel processing. One of their key features is built-in fault tolerance. In this work, we dive deeper into the implementation, performance, and efficiency of this critical feature for four state-of-the-art frameworks. We include the established Spark Streaming and Flink frameworks and the more novel Spark Structured Streaming and Kafka Streams frameworks. We test the behavior under different types of faults and settings: master failure with and without high-availability setups, driver failures for Spark frameworks, worker failure with or without exactly-once semantics, application and task failures. We highlight differences in behavior during these failures on several aspects, e.g., whether there is an outage, downtime, recovery time, data loss, duplicate processing, accuracy, and the cost and behavior of different message delivery guarantees. Our results highlight the impact of framework design on the speed of fault recovery and explain how different use cases may benefit from different approaches. Due to their task-based scheduling approach, the Spark frameworks can recover within 30 seconds and in most cases without necessitating an application restart. Kafka Streams has only a few seconds of downtime, but is slower at catching up on delays. Finally, Flink can offer end-to-end exactly-once semantics at a low cost but requires job restarts for most failures leading to high recovery times of around 50 seconds.
Highlights
T HE demand for near real-time processing has been soaring in the last decade with the rise of the IoT domain and a surge in time-sensitive use cases such as fraud detection and monitoring
We can conclude that high-availability setups are crucial to ensure the continuation and restart of processing applications in both Flink and Spark
Spark Streaming and Structured Streaming do not notice any impact of a master failures
Summary
T HE demand for near real-time processing has been soaring in the last decade with the rise of the IoT domain and a surge in time-sensitive use cases such as fraud detection and monitoring. Often, this requires processing large volumes of data for which a single machine does not suffice or becomes increasingly expensive. Many benchmarks were developed to study performance differences between these frameworks, e.g. The distributed architecture of these frameworks implies that several components can fail. Both Spark frameworks and Flink use a master for job scheduling. We experiment with enabling exactly-once semantics to determine the performance impact
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.