An enhanced model-based checkpointing protocol for preventing useless checkpoints

Jiang Wu,D Manivannan

doi:10.1080/17445760802615688

Abstract

Checkpointing and rollback recovery are widely used techniques to handle failures in distributed computing systems. If there is no coordination among processes during checkpointing, processes may take useless checkpoints. Useless checkpoints are checkpoints that cannot be part of any consistent global checkpoint. In this paper, we propose a Communication-Induced checkpointing algorithm that prevents useless checkpoints by directing processes to take forced checkpoints more efficiently whenever a communication pattern that may lead to a Z-Cycle (ZC) is observed. Existence of ZC among checkpoints is known to be necessary and sufficient for making a checkpoint useless. The basic idea behind our algorithm can be extended to existing model-based checkpointing algorithms to reduce the number of forced checkpoints. We also compare the performance of our algorithm with an existing well-known algorithm.

Full Text