Fault tolerant distributed computing using atomic send-receive checkpoints

Z.M Wojcik,B.E Wojcik

doi:10.1109/spdp.1990.143536

Abstract

The paper presents a deadlock free fault recovery algorithm for an entirely distributed system in which the messages do not need to arrive in the order they have been sent. The method is based on the asynchronous, atomic checkpointing of the sender and receiver of a message. Messages not balanced in the last permanent checkpoints are recorded in the new checkpoints. The fault recovery is based on: (a) repetition of all messages lost according to a record of unbalanced messages in the last permanent checkpoints, and on (b) undoing every message re-sent during the fault recovery, or undoing of a computation repeated according to a record of unbalanced messages in the last permanent checkpoints. A fault recovery involves only processes which communicated before a failure. A distributed computation may be split into a few segments without affecting transaction consistency. The algorithm involves the minimum number of messages. Proof of the resilience of the fault recovery algorithm is presented. >

Full Text