On unreliable computing systems when heavy-tails appear as a result of the recovery procedure

Pierre M Fiorini,Lester Lipsky,Robert Sheahan

doi:10.1145/1101892.1101898

Abstract

For some computing systems, failure is rare enough that it can be ignored. In other systems, failure is so common that how to handle it can have a significant impact on the performance of the system. There are many different recovery schemes for tasks, however, they can be classified into three broad categories: 1) Resume: when a task fails, it knows exactly where it stops and can continue at that point when allowed to resume (i.e., preemptive resume - prs); 2) Replace : when a task fails, then later when the processor continues, it begins with a brand new task (i.e., preemptive repeat different prd); and, 3) Restart: when a task fails it loses all work done to that point and must start anew upon continuing later (i.e., preemptive repeat identical - pri ).In this paper, assuming a computing system is unreliable, we discuss how heavy-tail (hereafter referred to as power-tail - PT) distributions can appear in a job's task stream given the Restart recovery procedure. This is an important consideration since it is known that power-tails can lead to unstable systems [4], We then demonstrate how to obtain performance and dependablity measures for a class of computing systems comprised of P unreliable processors and a finite number of tasks N given the above recovery procedures.

Full Text