Abstract

As the scale of High-Performance Computing (HPC) clusters continues to grow, their increasing failure rates and energy consumption levels are emerging as serious design concerns. Efficiently running systems at such large scales critically relies on deploying effective, practical methods for fault tolerance while having a good understanding of their respective performance and energy overheads. The most commonly used fault tolerance method is checkpoint/restart. Checkpoint scheduling policies, however, have been traditionally optimized and analysed from one angle: application performance. In this work, we provide an extensive analysis of the performance, energy and I/O costs associated with a wide array of checkpointing policies. We consider practical deployment issues and show that simple formulas can be used to accurately estimate wasted work in a system. We propose methods to optimize checkpoint scheduling for energy savings and evaluate the runtime-optimized and energy-optimized policies using simulations based on failure logs from 10 production HPC clusters. Our results show ample room for achieving high quality energy/performance tradeoffs when using methods that exploit characteristics of real world failures. We also analyze the impact of energy-optimized checkpointing on the storage subsystem and identify policies that are optimal for I/O savings.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call