A Case for Adaptive Redundancy for HPC Resilience

Saurabh Hukerikar,Robert F Lucas,Pedro C Diniz

doi:10.1007/978-3-642-54420-0_67

Abstract

Redundancy both in space and time has been widely used to detect and in some cases correct errors in High Performance Computing (HPC) systems. With the HPC community seeking exascale class supercomputers by the end of the decade, unrealistic expectations for correct system behavior will result in exorbitant costs in terms of performance lost and energy expended. Resilience strategies will need to find balance between fault coverage and the overheads incurred. In this work, we propose an adaptive approach that factors in application level knowledge together with runtime inference about the fault tolerance state of the system to dynamically enable redundant multithreading (RMT). Our approach is based on simple programming language extensions, tightly integrated with a compiler infrastructure and a runtime framework that enables managing the performance overheads of redundant computation.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Case for Adaptive Redundancy for HPC Resilience

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Code Modernization Tools for Assisting Users in Migrating to Future Generations of Supercomputers
Ritu Arora ... Lars Koesterke
-
Ritu Arora, et. al.Ritu Arora ... Lars Koesterke
01 Jan 2017
01 Jan 2017

Perspectives of China’s HPC system development: a view from the 2009 China HPC TOP100 list
Yunquan Zhang ... Guoxing Yuan
Frontiers of Computer Science in China | VOL. 4
Yunquan Zhang, et. al.Yunquan Zhang ... Guoxing Yuan
04 Nov 2010
Perspectives of China’s HPC system development: a view from the 2009 China HPC TOP100 list
Yunquan Zhang ... Guoxing Yuan

Multi-node Power/Performance Modeling for HPC System
Sangwoo Han ... Eui-Young Chung
-
Sangwoo Han, et. al.Sangwoo Han ... Eui-Young Chung
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Case for Adaptive Redundancy for HPC Resilience

Abstract

Talk to us

Similar Papers