Abstract

Large-scale High Performance Computing (HPC) systems continue to be designed and constructed to extend performance beyond Petascales using monolithic, cluster, and distributed architectures and emerging multi-core Central Processing Unit (CPU) technologies. As these machines grow so too does the size and variety of applications that run on them. Yet power management and interconnection performance are of great and mounting concern, and to date, the understanding of HPC subsystem interactions and their relationships to power efficiency remains less than desirable. Furthermore, Executive Order 13423 was issued in January of 2007 in an effort to ensure that Federal agencies operate in an environmentally, economically, and fiscally sound manner. It mandates a 30% reduction in energy intensity (MBTUs per square foot) of government facilities in the FY06-15 timeframe using FY03 as a baseline. Two major drawbacks hinder their ability to sustain consistent run time and energy efficient performance: (1) major subsystems interact with each other, often at the expense of unpredictable application run time and energy consumption, and (2) increased power density of these machines complicates the space, power, and cooling problem, resulting in partial or full system down time, further exacerbating run time unpredictability. We believe that one fundamental reason for the above limitations is the operational isolation of loosely coupled subsystems. While the development of subsystems in isolation has been the dominant model for decades, it is inherently unsuitable for ensuring consistent and sustainable systemic performance. We propose that the collection of HPC sub-systems, including the set of running applications must be collaborative in nature, and as such the HPC systems full potential is limited by subsystem isolation and autonomous actions to improve their individual subsystem performance. This paper describes an approach for using “Application-Level Behavioral Attribute Driven Techniques” to characterize HPC subsystem interactions into meaningful metrics and correlates that can be used as inputs to algorithms to control large-scale behaviors (job schedulers, routers, and HVAC systems) as well as smaller-scale behaviors such as CPU frequency and voltage scaling to achieve improved run time and energy efficiency to help satisfy Executive Order 13423.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call