Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

Saurabh Gupta,James Rogers,Don Maxwell,Christopher Jantzi,Devesh Tiwari

doi:10.1109/dsn.2015.52

Abstract

As we approach exascale, the scientific simulations are expected to experience more interruptions due to increased system failures. Designing better HPC resilience techniques requires understanding the key characteristics of system failures on these systems. While temporal properties of system failures on HPC systems have been well-investigated, there is limited understanding about the spatial characteristics of system failures and its impact on the resilience mechanisms. Therefore, we examine the spatial characteristics and behavior of system failures. We investigate the interaction between spatial and temporal characteristics of failures and its implications for system operations and resilience mechanisms on large-scale HPC systems. We show that system failures have "spatial locality" at different granularity in the system, study impact of different failure-types, and investigate the correlation among different failure-types. Finally, we propose a novel scheme that exploits the spatial locality in failures to improve application and system performance. Our evaluation shows that the proposed scheme significantly improves the system performance in a dynamic and production-level HPC system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Using On-Demand File Systems in HPC Environments
Mehmet Soysal ... Thorsten Zirwes
-
Mehmet Soysal, et. al.Mehmet Soysal ... Thorsten Zirwes
01 Jul 2019
01 Jul 2019

Towards a Unified Monitoring Framework for Power, Performance and Thermal Metrics: A Case Study on the Evaluation of HPC Cooling Systems
Aniruddha Marathe ... Ghaleb Abdulla
-
Aniruddha Marathe, et. al.Aniruddha Marathe ... Ghaleb Abdulla
01 May 2017
01 May 2017

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System
Bin Nie ... Tirthak Patel
-
Bin Nie, et. al.Bin Nie ... Tirthak Patel
01 Jun 2018
01 Jun 2018

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)
Saurabh Hukerikar ... Christian Engelmann
-
Saurabh Hukerikar, et. al.Saurabh Hukerikar ... Christian Engelmann
01 Oct 2016
01 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems

Abstract

Talk to us

Similar Papers