Rolex: resilience-oriented language extensions for extreme-scale systems

Saurabh Hukerikar,Robert F Lucas

doi:10.1007/s11227-016-1752-5

Abstract

Future exascale high-performance computing (HPC) systems will be constructed from VLSI devices that will be less reliable than those used today, and faults will become the norm, not the exception. This will pose significant problems for system designers and programmers, who for half-a-century have enjoyed an execution model that assumed correct behaviour by the underlying computing system. The mean time to failure of the system scales inversely to the number of components in the system and, therefore, faults and resultant system level failures will increase, as systems scale in terms of the number of processor cores and memory modules used. However, every error detected need not cause catastrophic failure. Many HPC applications are inherently fault resilient. Yet it is the application programmers who have this knowledge but lack mechanisms to convey it to the system. In this paper, we present new Resilience Oriented Language Extensions (Rolex) which facilitate the incorporation of fault resilience as an intrinsic property of the application code. We describe the syntax and semantics of the language extensions as well as the implementation of the supporting compiler infrastructure and runtime system. Our experiments show that an approach that leverages the programmer’s insight to reason about the context and significance of faults to the application outcome significantly improves the probability that an application runs to a successful conclusion.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Rolex: resilience-oriented language extensions for extreme-scale systems

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Journal: The Journal of Supercomputing	Publication Date: May 26, 2016
Citations: 15

Similar Papers

HPC Process and Optimal Network Device Affinitization
Ravindra Babu Ganapathi ... Russell W Mcguire
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4
Ravindra Babu Ganapathi, et. al.Ravindra Babu Ganapathi ... Russell W Mcguire
01 Oct 2018
IEEE Transactions on Multi-Scale Computing Systems | VOL. 4

Bridging the Divide Between HPC and Commodity System Software
John R Lange
-
John R LangeJohn R Lange
15 Jun 2015
15 Jun 2015

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading
Saurabh Hukerikar ... Robert F Lucas
International Journal of Parallel Programming | VOL. 46
Saurabh Hukerikar, et. al.Saurabh Hukerikar ... Robert F Lucas
11 Feb 2017
International Journal of Parallel Programming | VOL. 46

Cost-benefit analysis of high performance computing infrastructures
Amril Nazir ... Soren-Aksel Sorensen
-
Amril Nazir, et. al.Amril Nazir ... Soren-Aksel Sorensen
01 Dec 2010
01 Dec 2010

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Rolex: resilience-oriented language extensions for extreme-scale systems

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing