The paper proposes a new technique for providing software fault tolerance in concurrent systems. It combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism. This mechanism involves smaller overhead in case of moderate to high process interaction than the schemes considered in past, which are based upon the idea of local checkpointing. A model for computing the optimum checkpointing interval is also presented. A particular distribution is hypothesized for the coverage of the recovery, and the behavior of the model is studied in detail for this case.
Read full abstract