Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems

Hailu Xu,Sarker Tanzir Ahmed,Dilma Da Silva,Pinchao Liu,Liting Hu

doi:10.1109/tpds.2023.3251997

Abstract

Today, large-scale cloud organizations are deploying datacenters and “edge” clusters globally to provide their users with low-latency access to their services. Running stream applications across these geo-distributed sites are emerging as a daily requirement, such as making business decisions from marketing streams, identifying spam campaigns from social network streams, and analyzing existing genomes in different labs and countries to track the sources of a potential epidemic. However, while the progress has been encouraging, the existing efforts have dominantly centered around <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stateless stream processing</i> , leaving another urgent trend- <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">stateful stream processing</i> -much less explored. A driving need is that next-generation stream processing systems need to store and update states during processing, and most importantly, successfully recover large distributed states when faults and failures happen. Existing studies exhibit major limitations including: (1) they mostly inherit MapReduce's “single master/many workers” architecture, where the central master is responsible for all scheduling activities and easily becomes a scalability bottleneck; (2) they offer state recovery mainly through the use of three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or failing to handle multiple failures; and (3) they are not adaptive to heterogeneous hardware settings in the cloud. In this paper, we present A-FP4S, a novel adaptive fragments-based parallel state recovery mechanism for stream processing systems to manage and recover large distributed states for a massive number of stream applications. The novelty of A-FP4S is that we organize stream operators into a distributed hash table (DHT) based peer-to-peer (P2P) overlay. Then we divide each node's local state into many fragments and periodically store them in each node's multiple neighbors (the leaf set nodes of DHT), ensuring that different sets of available fragments can reconstruct failed states in parallel. By doing that, this failure recovery mechanism is extremely scalable to the size of the lost state, significantly reduces the failure recovery time, and can tolerate multiple node failures. A-FP4S is adaptive to heterogeneous hardware settings (e.g., CPU speed, disk/file-system speed, network bandwidth) by automatic parameter tuning over phases. Compared to Apache Storm, A-FP4S achieves a significant 31.8% to 50.5% reduction in recovery latency. It can scale to many simultaneous failures and successfully recover the state, even more than half of the operators fail or get lost. Large-scale experiments using real-world datasets demonstrate A-FP4S's attractive scalability and adaptivity properties.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems

Lead the way for us

Similar Papers

SR3
Hailu Xu ... Liting Hu
-
Hailu Xu, et. al.Hailu Xu ... Liting Hu
07 Dec 2020
07 Dec 2020

High-Performance Stateful Stream Processing on Solid-State Drives
Gyewon Lee ... Taegeon Um
-
Gyewon Lee, et. al.Gyewon Lee ... Taegeon Um
27 Aug 2018
27 Aug 2018

FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications
Pinchao Liu ... Hailu Xu
-
Pinchao Liu, et. al.Pinchao Liu ... Hailu Xu
01 May 2020
01 May 2020

Fast and Precise recovery in Stream processing based on Distributed Cache
Yingying Zheng ... Jun Wei
-
Yingying Zheng, et. al.Yingying Zheng ... Jun Wei
23 Sep 2017
23 Sep 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Parallel and Distributed Systems