Abstract

SummaryCoordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call