Abstract
We present a unifying approach to monitoring and analyzing various metrics crucial in understanding the operational characteristics at different levels of HPC systems. Increase in the performance of HPC-scale processors has been closely followed by an increase in the power draw of the processors and the scale of HPC systems. Consequently, the relationship between the thermal and power characteristics of the system, from processor-level to the cluster-level is becoming more complex. Our monitoring framework effectively brings together operational metrics collected by hardware and software monitoring components at the HPC cluster level and subsystem component level to enable a comprehensive analysis of these characteristics. We show the effectiveness of our unified monitoring capability through a comparative study of the efficiency of traditional air-cooling and a liquid-cooling retro-fit on our large-scale HPC system. Using our unified monitoring framework we are able to show, for the first time at our facility, that the liquid-cooled HPC system achieves significantly lower and more stable ambient temperatures in both temporal and spatial dimensions, lower temperature disparity across subsystem components and better system power efficiency than the air-cooled HPC system.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.