Automated Analysis of Fault-Tolerance in Distributed Systems

Scott D Stoller,Fred B Schneider

doi:10.1007/s10703-005-1492-2

Abstract

This paper describes a method for automated analysis of fault-tolerance properties of distributed systems. The framework is based on ideas from stream-processing semantics for networks of processes and abstract interpretation of programs. The stream-processing model provides modularity and a clean algorithmic basis for the analysis. For efficiency, all aspects of a system''s behavior can be approximated in the analysis, including: values (the data transmitted in messages), multiplicities (the number of times each value is sent), and orderings (the order in which values are sent). The approximation mechanisms are based on abstract interpretation, symbolic computation, and partial orders. Approximations are essential to support abstraction from aspects of a system''s behavior that do not directly impact its fault-tolerance. Another feature of our approach is that perturbations due to failures can be represented explicitly. This allows fault-tolerance requirements to be expressed as bounds on the acceptable perturbations to a system''s behavior as a consequence of certain failures. This facilitates separation of fault-tolerance from other correctness requirements and sometimes enables more efficient analysis. The analysis has been implemented in a prototype tool.

Full Text