System-wide trade-off modeling of performance, power, and resilience on petascale systems

Li Yu,Zhiling Lan,Zhou Zhou,Yuping Fan,Michael E Papka

doi:10.1007/s11227-018-2368-8

Abstract

While performance remains a major objective in the field of high-performance computing (HPC), future systems will have to deliver desired performance under both reliability and energy constraints. Although a number of resilience methods and power management techniques have been presented to address the reliability and energy concerns, the trade-offs among performance, power, and resilience are not well understood, especially in HPC systems with unprecedented scale and complexity. In this work, we present a co-modeling mechanism named TOPPER (system-wide Trade-Off modeling for Performance, PowEr, and Resilience). TOPPER is build with colored Petri nets which allow us to capture the dynamic, complicated interactions and dependencies among different factors such as workload characteristics, hardware reliability, runtime system operation, on a petascale machine. Using system traces collected from a production supercomputer, we conducted a series of experiments to analyze various resilience methods, power capping techniques, and job characteristics in terms of system-wide performance and energy consumption. Our results provide interesting insights regarding performance–power–resilience trade-offs on HPC systems.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

System-wide trade-off modeling of performance, power, and resilience on petascale systems

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Journal: The Journal of Supercomputing	Publication Date: Apr 11, 2018
Citations: 9

Similar Papers

State-of-the-Art Power Management Techniques
Maaz Ahmed ... Waseem Ahmed
-
Maaz Ahmed, et. al.Maaz Ahmed ... Waseem Ahmed
20 Sep 2021
20 Sep 2021

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale (V.2.0)
Christian Engelmann ... Rizwan Ashraf
-
Christian Engelmann, et. al.Christian Engelmann ... Rizwan Ashraf
16 Dec 2022
16 Dec 2022

Topology-Aware Event Sequence Mining for Understanding HPC System Behavior and Detecting Anomalies
Zongze Li ... Song Fu
-
Zongze Li, et. al.Zongze Li ... Song Fu
01 Aug 2019
01 Aug 2019

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

System-wide trade-off modeling of performance, power, and resilience on petascale systems

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing