Concurrent checkpoint initiation and recovery algorithms on asynchronous ring networks

Partha Sarathi Mandal,Krishnendu Mukhopadhyaya

doi:10.1016/j.jpdc.2004.03.013

Abstract

Checkpointing with rollback recovery is a well-known method for achieving fault-tolerance in distributed systems. In this work, we introduce algorithms for checkpointing and rollback recovery on asynchronous unidirectional and bi-directional ring networks. The proposed checkpointing algorithms can handle multiple concurrent initiations by different processes. While taking checkpoints, processes do not have to take into consideration any application message dependency. The synchronization is achieved by passing control messages among the processes. Application messages are acknowledged. Each process maintains a list of unacknowledged messages. Here we use a logical checkpoint, which is a standard checkpoint (i.e., snapshot of the process) plus a list of messages that have been sent by this process but are unacknowledged at the time of taking the checkpoint. The worst case message complexity of the proposed checkpointing algorithm is O( kn) when k initiators initiate concurrently. The time complexity is O( n). For the recovery algorithm, time and message complexities are both O( n).

Full Text