Trading Fault Tolerance for Performance in AN Encoding

Norman A Rink,Jeronimo Castrillon

doi:10.1145/3075564.3075565

Abstract

Increasing rates of transient hardware faults pose a problem for computing applications. Current and future trends are likely to exacerbate this problem. When a transient fault occurs during program execution, data in the output can become corrupted. The severity of output corruptions depends on the application domain. Hence, different applications require different levels of fault tolerance. We present an LLVM-based AN encoder that can equip programs with an error detection mechanism at configurable levels of rigor. Based on our AN encoder, the trade-off between fault tolerance and runtime overhead is analyzed. It is found that, by suitably configuring our AN encoder, the runtime overhead can be reduced from 9.9x to 2.1x. At the same time, however, the probability that a hardware fault in the CPU will result in silent data corruption rises from 0.007 to over 0.022. The same probability for memory faults increases from 0.009 to over 0.032. It is further demonstrated, by applying different configurations of our AN encoder to the components of an arithmetic expression interpreter, that having fine-grained control over levels of fault tolerance can be beneficial.

Full Text