Self-stabilizing algorithm for checkpointing in a distributed system

Partha Sarathi Mandal,Krishnendu Mukhopadhyaya

doi:10.1016/j.jpdc.2007.02.006

Abstract

If the variables used for a checkpointing algorithm have data faults, the existing checkpointing and recovery algorithms may fail. In this paper, self-stabilizing data fault detecting and correcting, checkpointing, and recovery algorithms are proposed in a ring topology. The proposed data fault detection and correction algorithms can handle data faults; at most one per process, but in any number of processes. The proposed checkpointing algorithm can deal with concurrent multiple initiations of checkpointing and data faults. A process can recover from a fault, using the proposed recovery algorithm in spite of multiple data faults present in the system. All the proposed algorithms converge in O ( n ) steps, where n is the number of processes. The algorithm can be extended to work for general topologies too.

Full Text