On Fault Tolerance for Distributed Iterative Dataflow Processing

Chen Xu,Markus Holzemer,Juan Soto,Manohar Kaul,Volker Markl

doi:10.1109/tkde.2017.2690431

Abstract

Large-scale graph and machine learning analytics widely employ distributed iterative processing. Typically, these analytics are a part of a comprehensive workflow, which includes data preparation, model building, and model evaluation. General-purpose distributed dataflow frameworks execute all steps of such workflows holistically. This holistic view enables these systems to reason about and automatically optimize the entire pipeline. Here, graph and machine learning analytics are known to incur a long runtime since they require multiple passes over the data until convergence is reached. Thus, fault tolerance and a fast-recovery from any intermittent failure is critical for efficient analysis. In this paper, we propose novel fault-tolerant mechanisms for graph and machine learning analytics that run on distributed dataflow systems. We seek to reduce checkpointing costs and shorten failure recovery times. For graph processing, rather than writing checkpoints that block downstream operators, our mechanism writes checkpoints in an unblocking manner that does not break pipelined tasks. In contrast to the conventional approach for unblocking checkpointing (e.g., that manage checkpoints independently for immutable datasets), we inject the checkpoints of mutable datasets into the iterative dataflow itself. Hence, our mechanism is iteration-aware by design. This simplifies the system architecture and facilitates coordinating checkpoint creation during iterative graph processing. Moreover, we are able to rapidly rebound, via confined recovery, by exploiting the fact that log files exist locally on healthy nodes and managing to avoid a complete recomputation from scratch. In addition, we propose replica recovery for machine learning algorithms, whereby we employ a broadcast variable that enables us to quickly recover without having to introduce any checkpoints. In order to evaluate our fault tolerance strategies, we conduct both a theoretical study and experimental analyses using Apache Flink and discover that they outperform blocking checkpointing and complete recovery.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On Fault Tolerance for Distributed Iterative Dataflow Processing

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Aug 1, 2017
Citations: 12

Similar Papers

Efficient fault-tolerance for iterative graph processing on distributed dataflow systems
Chen Xu ... Volker Markl
-
Chen Xu, et. al.Chen Xu ... Volker Markl
01 May 2016
01 May 2016

AsynGraph
Yu Zhang ... Haikun Liu
ACM Transactions on Architecture and Code Optimization | VOL. 17
Yu Zhang, et. al.Yu Zhang ... Haikun Liu
30 Sep 2020
ACM Transactions on Architecture and Code Optimization | VOL. 17

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments
Dawei Sun ... Xingwei Wang
The Journal of Supercomputing | VOL. 66
Dawei Sun, et. al.Dawei Sun ... Xingwei Wang
21 Mar 2013
The Journal of Supercomputing | VOL. 66

CSHFt: A Composite Fault-Tolerant Architecture and Self-Adaptable Hierarchical Fault-Tolerant Strategy for Satellite System
Hao Zhou ... Jingfei Jiang
-
Hao Zhou, et. al.Hao Zhou ... Jingfei Jiang
01 Oct 2011
01 Oct 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On Fault Tolerance for Distributed Iterative Dataflow Processing

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering