Abstract
While performance remains a major objective in the field of high-performance computing (HPC), future systems will have to deliver desired performance under both reliability and energy constraints. Although a number of resilience methods and power management techniques have been presented to address the reliability and energy concerns, the trade-offs among performance, power, and resilience are not well understood, especially in HPC systems with unprecedented scale and complexity. In this work, we present a co-modeling mechanism named TOPPER (system-wide Trade-Off modeling for Performance, PowEr, and Resilience). TOPPER is build with colored Petri nets which allow us to capture the dynamic, complicated interactions and dependencies among different factors such as workload characteristics, hardware reliability, runtime system operation, on a petascale machine. Using system traces collected from a production supercomputer, we conducted a series of experiments to analyze various resilience methods, power capping techniques, and job characteristics in terms of system-wide performance and energy consumption. Our results provide interesting insights regarding performance–power–resilience trade-offs on HPC systems.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.