Transient Fault Tolerance on Chip Multiprocessor Based on Dual and Triple Core Redundancy

Rui Gong,Zhiying Wang,Kui Dai

doi:10.1109/prdc.2008.40

Abstract

To address the increasing susceptibility of microprocessors to transient faults, many techniques have been proposed to exploit the core redundancy of chip multiprocessors (CMPs). But the inter-core communications become critical in these core redundancy based techniques. To reduce the inter-core communication bandwidth demand, two new approaches, dual core redundancy (DCR) and triple core redundancy (TCR), are proposed for fault tolerance in this paper. In DCR, only store instructions are compared before commit, so that the bandwidth demand can be largely reduced. And the fault recovery is achieved by context saving and recovery. While TCR applies triple modular redundancy (TMR) in the core level to efficiently exploit the core resources of CMPs for transient fault masking. In TCR, only the results of store instructions are compared to detect transient fault and reduce the inter-core communication bandwidth demand. Once detecting a single event upset (SEU), TCR can be reconfigured to execute with the two uncorrupted cores for fault detection.The experimental results demonstrate that compared to traditional transient fault recovery scheme CRTR, both DCR and TCR efficiently reduce inter-core bandwidth demand. DCR achieves transient fault recovery with reasonable performance overhead caused by context saving. TCR occupies more core resources and has the lowest performance overhead during normal execution.

Full Text