Units of Computation in Fault-Tolerant Distributed Systems

Mohan Ahuja,Shivakant Mishra

doi:10.1006/jpdc.1996.1277

Abstract

We develop a framework that helps in understanding a fault-tolerant distributed system and so aids in designing such systems. We illustrate the uses of the developed work in application areas such as checkpointing and recovery, phase termination detection, stable property detection, implementing membership protocols, debugging, and design of programming languages. We define a unit of computation, and refer to it as a molecule. A molecule has a well defined interface with other molecules. The smallest such unit—an indivisible molecule—is termed an atom. We show that any execution of a fault-tolerant distributed computation can be seen as an execution of molecules/atoms in a partial order, and such a view provides insights into understanding the computation, particularly for a fault-tolerant system where it is important to guarantee that a unit of computation is either completely executed or not at all and system designers need to reason about the states after execution of such units. Molecules are essentially a generalization of atomic actions.

Full Text