Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

Xinhai Xu,Yufei Lin

doi:10.1109/cicn.2012.59

Abstract

Nowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the mainstream C/R methods are either based on Fail-Stop fault model or making the system(or program) do error detection before storing checkpoints, so they can ensure the correctness of every checkpoint. However, the faults occurring in the systems in real world are more accordant with the Byzantine fault model, and in order to pursue the higher practical performance, neither the system nor the program implements any fault detection mechanism. Consequently, there may be errors in the checkpoints. This paper studies the checkpoint selection problem that which checkpoint should be selected as the object of rolling back after system occurring failure, based on Byzantine fault model. We design a framework of checkpoint selection, and then, based on it, propose three checkpoint selection strategies: conservative strategy, aggressive strategy and statistical strategy. The simulation results show that: the conservative strategy shows its superiority when the error latent period is long, while the aggressive strategy behaves oppositely, the statistical strategy has a stable efficiency, only 50% more overhead compared to the ideal checkpoint selection when the checkpoint period is the half of mean time between faults.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Adaptive selection of necessary and sufficient checkpoints for dynamic verification of temporal constraints in grid workflow systems
Jinjun Chen ... Yun Yang
ACM Transactions on Autonomous and Adaptive Systems | VOL. 2
Jinjun Chen, et. al.Jinjun Chen ... Yun Yang
01 Jun 2007
ACM Transactions on Autonomous and Adaptive Systems | VOL. 2

Necessary and sufficient checkpoint selection for temporal verification of high-confidence cloud workflow systems
Futian Wang ... Yun Yang
Science China Information Sciences | VOL. 58
Futian Wang, et. al.Futian Wang ... Yun Yang
08 Apr 2015
Science China Information Sciences | VOL. 58

Activity Completion Duration Based Checkpoint Selection for Dynamic Verification of Temporal Constraints in Grid Workflow Systems
Jinjun Chen ... Yun Yang
The International Journal of High Performance Computing Applications | VOL. 22
Jinjun Chen, et. al. Jinjun Chen ... Yun Yang
01 Aug 2008
The International Journal of High Performance Computing Applications | VOL. 22

Temporal dependency based checkpoint selection for dynamic verification of fixed-time constraints in grid workflow systems
Jinjun Chen ... Yun Yang
-
Jinjun Chen, et. al.Jinjun Chen ... Yun Yang
01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model

Abstract

Talk to us

Similar Papers