Abstract

This article considers best checkpointing control realizable in real-world systems, whose mean time between failures (MTBFs) often fluctuate. The considered control scheme is based on equating aggregate checkpointing overhead over an activity sequence of interest (θ) and the expected rework amount after a failure recovery for best checkpointing, called “CHORE” (i.e., checkpointing overhead and rework equated), where θ starts from execution resumption after failure recovery and ends after restore from the following failure. CHORE lets its inter-checkpoint intervals in θ follow a pre-determined sequence independent of MTBF to aim at performance optimality and is shown analytically to keep overall execution time overhead upper bounded. When failure occurrences are tracked during job execution for real-time MTBF estimation, an enhanced CHORE (dubbed En-CHORE) is obtained to lower checkpointing overhead by skipping certain checkpoints at the beginning of each θ before taking checkpoints with the most desirable inter-checkpoint intervals determined on-the-fly for best checkpointing control. En-CHORE can outperform optimal checkpointing (which follows a fixed inter-checkpoint interval optimized for one constant global MTBF known a prior) both under synthetic random failures with local MTBF fluctuating markedly and under real failure traces of 22 real HPC systems (whose failure rates actually fluctuate over their trace time spans).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call