Fault detection and tolerance mechanisms for future 1000 core systems

Bernhard Fechner,Arne Garbade,Sebastian Weis,Theo Ungerer

doi:10.1109/hpcsim.2013.6641467

Abstract

The enormous growth in integration density enables to build processors with more and more cores on a single die, but also makes them orders of magnitude more vulnerable to faults due to voltage fluctuation, radiation, and process variations [4] etc. Since this trend will continue in the future, fault-tolerance mechanisms must be an essential part of such future systems if the computations are to be carried out on a reliable basis. Already, chip manufacturers have taken measures to handle faults in current multi-core processors such as error correcting codes for busses, caches etc. With a huge number of cores, common strategies like dual modular and triple modular redundant processing [5] along with massive parallel computing are possible. Threaded dataflow execution models are one way to exploit the parallelism of future 1000 core systems. Current GPU architectures reflect that [3]. The side-effect free execution of threads within the dataflow execution model can not only be used to provide massive parallel computational capacity, but also enables simple and efficient rollback mechanisms [16]. In this paper, we describe fault detection and tolerance mechanisms investigated within the TERAFLUX EC project [17], which offers a solution to exploit the massive parallelism offered by dataflow architectures at all abstraction levels.

Full Text