Abstract

Due to the system scaling, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">transient errors</i> caused by external noise, e.g., heat fluxes and particle strikes, have become a growing concern for the current and upcoming exa-scale high-performance-computing (HPC) systems. Applications running on these systems are expected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs or cause them to crash. However, since such errors are still quite rare as compared to no-fault cases, desirable solutions call for low/no-overhead systems that do not compromise the performance under no-fault conditions and also allow very fast fault recovery to minimize downtime. In this article, we present <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> , a light-weight compiler-assisted resilience technique to quickly and accurately recover processes from transient-error-induced crashes. During the compilation of applications, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> constructs a set of recovery kernels for crash-prone instructions. These recovery kernels are executed to repair the corrupted process states on-the-fly upon occurrences of errors, enabling applications to continue their executions instead of being terminated. When constructing recovery kernels, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> exploits side effects introduced by induction variable based code optimization techniques based on loop unrolling and strength reduction to improve its recovery capability. To this end, two new code transformation passes are introduced to expose the side effects for resilience purposes. We evaluated <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> with 4 scientific workloads as well as the NPB benchmarks suite. During their normal execution, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> incurs almost <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">zero</b> runtime overhead and a small, fixed <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">27MB</b> memory overhead. Meanwhile, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> can recover on an average 83.55 percent of crash-causing errors within dozens of milliseconds with negligible downtime. We also evaluated <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> with parallel jobs running on 3072 cores and showed that <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> can successfully mask the impact of crash-causing errors by providing almost uninterrupted execution. Finally, we present our preliminary evaluation result for BLAS, which shows that <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> is capable of recovering failures in libraries with a very high coverage rate of 83 percent and negligible overheads. With such an effective recovery mechanism, <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">IterPro</b> could tremendously mitigate the overheads and resource requirements of the resilience subsystem in future exa-scale systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call