Fault-Tolerant Computing: An Introduction and a Perspective

C.R Kime

doi:10.1109/t-c.1975.224246

Abstract

FAULT-TOLERANT computing has been defined as ability to execute specified algorithms correctly regardless of hardware total system flaws, or program fallacies [1]. To the extent that a system falls short of meeting the requirements of this definition, it can be labeled a partially fault-tolerant system [2]. Thus the definition of fault-tolerant computing provides a standard against which to measure all systems having a degree of tolerance. In particular, one can classify systems according to: 1), the amount of manual intervention required in performing three basic functions, and 2) the class of faults covered by three basic functions involved in tolerance: system validation, diagnosis, and masking or recovery. The word fault here is used to inclusively describe failures, flaws, and fallacies in the original definition. The first function is involved in the design and production of the system hardware and software, while the last two functions are embodied in the system itself. Likewise, the first function is directed to handling faults arising from design and production errors, whereas the last two functions are aimed at faults due to random hardware failures.

Full Text