A Cross-Layer Multicore Architecture to Tradeoff Program Accuracy and Resilience Overheads

Qingchuan Shi,Omer Khan,Henry Hoffmann

doi:10.1109/lca.2014.2365204

Abstract

To protect multicores from soft-error perturbations, resiliency schemes have been developed with high coverage but high power/performance overheads (~2x). We observe that not all soft-errors affect program correctness, some soft-errors only affect program accuracy, i.e., the program completes with certain acceptable deviations from soft-error free outcome. Thus, it is practical to improve processor efficiency by trading off resilience overheads with program accuracy. We propose the idea of declarative resilience that selectively applies resilience schemes to both crucial and non-crucial code, while ensuring program correctness. At the application level, crucial and non-crucial code is identified based on its impact on the program outcome. The hardware collaborates with software support to enable efficient resilience with 100 percent soft-error coverage. Only program accuracy is compromised in the worst-case scenario of a soft-error strike during non-crucial code execution. For a set of multithreaded benchmarks, declarative resilience improves completion time by an average of 21 percent over state-of-the-art hardware resilience scheme that protects all executed code. Its performance overhead is ~1.38x over a multicore that does not support resilience.

Full Text