On-Demand Fault-Tolerant Loop Processing

Alexandru-Petru Tanase,Jürgen Teich,Frank Hannig

doi:10.1007/978-3-319-73909-0_5

Abstract

Since the feature sizes of silicon devices continue to shrink, it is imperative to counter the increasing proneness to errors of modern, complex systems by applying appropriate fault tolerance measures. In this chapter, we therefore propose new techniques that leverage the advantages of self-organizing computing paradigms such as invasive computing to implement fault tolerance on multiprocessor systems-on-chips (MPSoCs) adaptively. We presented new compile time transformations that introduce modular redundancy into a loop program to protect it against soft errors. Our approach uses the abundant number of processing elements (PEs) within a tightly coupled processor array (TCPA) to claim not only one region of a processor array, but instead two (dual modular redundancy (DMR)) or three (triple modular redundancy (TMR)) contiguous neighboring regions of PEs. At the source code level, the compiler realizes these replication schemes with a program transformation that: (1) replicates a parallel loop program two or three times for DMR or TMR, respectively, and (2) introduces appropriate voting operations whose frequency and location may be chosen from three proposed variants. Which variant to choose depends, for example, on the error resilience needs of the application or the expected soft error rates. Finally, we explore the different tradeoffs of these variants regarding performance overheads and error detection latency.

Full Text