Checkpointing Strategies to Tolerate Non-Memoryless Failures on HPC Platforms

Anne Benoit,Lucas Perotin,Yves Robert,Frédéric Vivien

doi:10.1145/3624560

Anne Benoit, Lucas Perotin + Show 2 more

Open Access

https://doi.org/10.1145/3624560

Copy DOI

Abstract

This article studies checkpointing strategies for parallel applications subject to failures. The optimal strategy to minimize total execution time, or makespan, is well known when failure IATs obey an Exponential distribution, but it is unknown for non-memoryless failure distributions. We explain why the latter fact is misunderstood in recent literature. We propose a general strategy that maximizes the expected efficiency until the next failure, and we show that this strategy achieves an asymptotically optimal makespan, thereby establishing the first optimality result for arbitrary failure distributions. Through extensive simulations, we show that the new strategy is always at least as good as the Young/Daly strategy for various failure distributions. For distributions with high infant mortality (such as LogNormal with shape parameter k =2.51 or Weibull with shape parameter 0.5), the execution time is divided by a factor of 1.9 on average, and up to a factor 4.2 for recently deployed platforms.

Full Text