A case for two-level distributed recovery schemes

Nitin H Vaidya

doi:10.1145/223587.223596

Nitin H Vaidya

https://doi.org/10.1145/223587.223596

Copy DOI

Export

Save

Cite

Publication Date: May 1, 1995

Citations: 64

Affiliation: Texas A&M University

Abstract
Full-Text
Similar Papers

Abstract

Listen

Most distributed and multiprocessor recovery schemes proposed in the literature are designed to tolerate arbitrary number of failures. In this paper, we demonstrate that, it is often advantageous to use "two-level" recovery schemes. A two-level recovery scheme tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead. By minimizing the overhead for the more frequently occurring failure scenarios, our approach is expected to achieve lower performance overhead (on average) as compared to existing recovery schemes.To demonstrate the advantages of two-level recovery, we evaluate the performance of a recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N-checkpoints. A single failure can be tolerated by rolling the system back to a 1-checkpoint, while multiple failure recovery is possible by rolling back to an N-checkpoint. For such a system, we demonstrate that to minimize the average overhead, it is often necessary to take both 1-checkpoints and N-checkpoints.While the conclusions of this paper are intuitive, the work on design of appropriate recovery schemes is lacking. The objective of this paper is to motivate research into recovery schemes that can provide multiple levels of fault tolerance.

Full Text