Efficient algorithms for reliability analysis of distributed computing systems

Min-Sheng Lin

doi:10.1016/s0020-0255(99)00003-1

Abstract

A distributed computing system is modeled as a collection of resources (e.g. processing elements, data files and programs) interconnected via an arbitrary communication network and controlled by a distributed operating system. The distributed program reliability in a distributed computing system is the probability of successful execution of a program running on multiple processing elements and needs to retrieve data files from other processing elements. This reliability varies according to (1) the topology of the distributed computing system, (2) the reliability of the communication edges, (3) the data files and programs distribution among processing elements and (4) the data files required to execute a program. In addition, computing the reliability of distributed computing systems is #P-complete even when the distributed computing system is restricted to a series-parallel, a 2-tree, a tree, or a star structure. This paper presents efficient algorithms for computing the reliability of a distributed program running on other restricted classes of networks.

Full Text