The problem of recovering from transient processor failures in distributed computing systems using optimistic message logging is considered, and two crash recovery algorithms are presented. Both algorithms have three advantages: 1. (1) no nonfaulty processor is forced to roll back 2. (2) output messages may be committed immediately, and 3. (3) a nonfaulty processor is not disrupted if none of its neighbors is faulty. The first algorithm uses acknowledgments, permits at most two neighboring processors to fail at the same time, and does not block the receiver or the sender of each message. The second algorithm copes with the simultaneous failure of any number of nodes, but it requires that the sender id of each message (not its content) be logged to stable storage before the message is processed. Though the second algorithm blocks the receiver, the blocking time can be substantially reduced. In both cases, the state of a faulty processor immediately before its failure is recreated. Thus, after either of the two crash recovery algorithms is run, the distributed system restarts from a consistent state.
Read full abstract