Resilience-Aware Resource Management for Exascale Computing Systems

Daniel Dauwe,Sudeep Pasricha,Howard Jay Siegel,Anthony A Maciejewski

doi:10.1109/tsusc.2018.2797890

Daniel Dauwe, Sudeep Pasricha + Show 2 more

Open Access

https://doi.org/10.1109/tsusc.2018.2797890

Copy DOI

Journal: IEEE Transactions on Sustainable Computing	Publication Date: Oct 1, 2018
Citations: 51	License type: publisher-specific, author manuscript

Affiliation: Colorado State University

Abstract

With the increases in complexity and number of nodes in large-scale high performance computing (HPC) systems over time, the probability of applications experiencing runtime failures has increased significantly. Projections indicate that exascale-sized systems are likely to operate with mean time between failures (MTBF) of as little as a few minutes. Several strategies have been proposed in recent years for enabling systems of these extreme sizes to be resilient against failures. This work provides a comparison of four state-of-the-art HPC resilience protocols that are being considered for use in exascale systems. We explore the behavior of each resilience protocol operating under the simulated execution of a diverse set of applications and study the performance degradation that a large-scale system experiences from the overhead associated with each resilience protocol as well as the re-computation needed to recover when a failure occurs. Using the results from these analyses, we examine how resource management on exascale systems can be improved by allowing the system to select the optimal resilience protocol depending upon each application's execution characteristics, as well as providing the system resource manager the ability to make scheduling decisions that are “resilience aware” through the use of more accurate execution time predictions.

Full Text