DINO: Divergent node cloning for sustained redundancy in HPC

Arash Rezaei,Frank Mueller,Paul Hargrove,Eric Roman

doi:10.1016/j.jpdc.2017.06.010

Abstract

Complexity and scale of next generation HPC systems pose significant challenges in fault resilience methods such that contemporary checkpoint/restart (C/R) methods that address fail-stop behavior may be insufficient. Redundant computing has been proposed as an alternative at extreme scale. Triple redundancy has an advantage over C/R in that it can also detect silent data corruption (SDC) and then correct results via voting. However, current redundant computing approaches do not repair failed or corrupted replicas. Consequently, SDCs can no longer be detected after a replica failure since the system has been degraded to dual redundancy without voting capability. Hence, a job may have to be aborted if voting uncovers mismatching results between the remaining two replicas. And while replicas are logically equivalent, they may have divergent runtime states during job execution, which presents a challenge to simply creating new replicas dynamically.This problem is addressed by, DIvergent NOde cloning (DINO), a redundant execution environment that quickly recovers from hard failures. DINO consists of a novel node cloning service integrated into the MPI runtime system that solves the problem of consolidating divergent states among replicas on-the-fly. With DINO, after degradation to dual redundancy, a good replica can be quickly cloned so that triple redundancy is restored. We present experimental results over 9 NAS Parallel Benchmarks (NPB), Sweep3D and LULESH. Results confirm the applicability of the approach and the correctness of the recovery process and indicate that DINO can recover from failures nearly instantly. The cloning overhead depends on the process image size that needs to be transferred between source and destination of the clone operation and varies between 5.60 to 90.48 s. Simulation results with our model show that dual redundancy with DINO recovery always outperforms 2x and surpasses 3x redundancy on up to 1 million nodes. To the best of our knowledge, the design and implementation for repairing failed replicas in redundant MPI computing is unprecedented.

Full Text