Abstract
This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC) applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2) product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.
Highlights
Efficient energy management is central to the effective operation of modern processors in platforms from mobile to data centers and high-performance computing (HPC) machines
We evaluated a sub-set of benchmarks (S3D, Sort, Stencil2D, Breadthfirst Search (BFS)) from the Scalable Heterogeneous Computing (SHOC) benchmark suite [13] that represents a large portion of scientific code found in HPC applications
This paper proposed and implemented a set of techniques to improve the energy efficiency of integrated CPU–graphics processing units (GPUs) processors
Summary
Efficient energy management is central to the effective operation of modern processors in platforms from mobile to data centers and high-performance computing (HPC) machines. Driven in part by demand for energy efficiency, we have seen the emergence of such processors with attached graphics processing units (GPUs) acting as accelerators. It contains two out-of-order dual-core CPU compute units (CUs, referred to as Piledriver modules) and a GPU. The GPU consists of 384 AMD RadeonTM cores, each capable of one single-precision fused multiply-add computation (FMAC) operation per cycle (the methodology and techniques in this paper are applicable to processors that support double-precision). More details on the AMD A-Series processor can be found in [32]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.