Optimal fault-tolerant computing on multiprocessor systems

John Bruno,E.G Coffman Jr

doi:10.1007/s002360050110

Abstract

Suppose $m \ge 2$ identical processors, each subject to random failures, are available for running a single job of given duration $\tau$ . The failure law is operative only while a processor is active. To guard against the loss of accrued work due to a failure, checkpoints can be made, each requiring time $\delta$ ; a successful checkpoint saves the state of the computation, but failures can also occur during checkpoints. The problem is to determine how best to schedule checkpoints if the goal is to maximize the probability that the job finishes before all $m$ processors fail. We solve this problem first for $m=2$ and an exponential failure law. For given $\tau$ and $\delta$ we show how to determine an integer $k \ge 0$ and time intervals $I_1, \ldots, I_{k+1}$ such that an optimal procedure is to run the job on one processor, checkpointing at the end of each interval $I_j, j = 1, \ldots, k$ , until either the job is done or a failure occurs. In the latter case, the remaining processor resumes the job starting in the state saved by the last successful checkpoint; the job then runs until it completes or until the second processor also fails. We give an explicit formula for the maximum achievable probability of completing the job for any fixed $k \ge 0$ . An explicit result for $k_{opt}$ , the optimum value of $k$ , seems out of reach; however, we give upper and lower bounds on $k_{opt}$ that are remarkably tight; they show that only a few values of $k$ need to be tested in order to find $k_{opt}$ . With the failure rate normalized to 1, we also derive the asymptotic estimate $$ k_{opt} - \sqrt{2 \tau / \delta} = O(1)~~{\rm as}~~ \delta \to 0 ~, $$ and calculate conditional expected job completion times. For the more difficult problem with $m \ge 3$ processors, we formulate a computational approach based on a discretized model in which the failure law is the analogous geometric distribution. By proving a unimodality property of the optimal completion probability, we are able to describe a computation of this optimum that requires $O(m n \log n )$ time, where $n$ is the job running time. Several examples bring out behavioral details.

Full Text