Teraflops supercomputer: architecture and validation of the fault tolerance mechanisms

C Constantinescu

doi:10.1109/12.869320

C Constantinescu

https://doi.org/10.1109/12.869320

Copy DOI

Export

Save

Cite

Journal: IEEE Transactions on Computers	Publication Date: Jan 1, 2000
Citations: 19

Abstract
Full-Text
Similar Papers

Abstract

Listen

Intel Corporation developed the Teraflops supercomputer for the US Department of Energy (DOE) as part of the Accelerated Strategic Computing Initiative (ASCI). This is the most powerful computing machine available today, performing over two trillion floating point operations per second with the aid of more than 9,000 Intel processors. The Teraflops machine employs complex hardware and software fault/error handling mechanisms for complying with DOE's reliability requirements. This paper gives a brief description of the system architecture and presents the validation of the fault tolerance mechanisms. Physical fault injection at the IC pin level was used for validation purposes. An original approach was developed for assessing signal sensitivity to transient faults and the effectiveness of the fault/error handling mechanisms. Dependency between fault/error detection coverage and fault duration was also determined. Fault injection experiments unveiled several malfunctions at the hardware, firmware, and software levels. The supercomputer performed according to the DOE requirements after corrective actions were implemented. The fault injection approach presented in this paper can be used for validation of any fault-tolerant or highly available computing system.

Full Text