The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Scott Levy,Kurt B Ferreira,Patrick Widener

doi:10.1002/cpe.4890

Scott Levy, Kurt B Ferreira + Show 1 more

Open Access

https://doi.org/10.1002/cpe.4890

Copy DOI

Abstract

SummaryCoordinated checkpoint/restart is currently the dominant approach to mitigating the impact of failures on important scientific applications running on large‐scale distributed systems. However, there is widespread evidence that coordinated checkpointing may no longer be viable on next‐generation systems. Uncoordinated checkpoint/restart attempts to address the shortcomings of coordinated checkpoint/restart by allowing application processes to checkpoint their state independently. However, eliminating coordination may significantly degrade application performance. In this paper, we propose an approach that leverages existing coordination in important scientific applications to approximately coordinate checkpoints. Specifically, we propose to extend MPI implementations to force checkpoints to occur immediately after the completion of a collective operation. We evaluate the performance implications of this approach using an existing validated simulation framework. Our results demonstrate that approximately coordinated checkpointing can significantly improve application performance relative to totally uncoordinated checkpointing. We also show that forcing checkpoints to occur following a collective operation has a small impact on the nominal checkpoint interval for several important workloads. As a whole, the results presented in this paper demonstrate that approximately coordinated checkpointing may provide significant performance benefits without significantly increasing the cost of failure recovery.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Sep 9, 2018
Citations: 1	License type: publisher-specific, author manuscript

R Discovery Prime

R Discovery Prime

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Similar Papers

Exploiting hierarchy in parallel computer networks to optimize collective operation performance
N.T Karonis ... I Foster
-
N.T Karonis, et. al.N.T Karonis ... I Foster
04 Feb 2000
04 Feb 2000

Model-based selection of optimal MPI broadcast algorithms for multi-core clusters
Emin Nuriyev ... Alexey Lastovetsky
Journal of Parallel and Distributed Computing | VOL. 165
Emin Nuriyev, et. al.Emin Nuriyev ... Alexey Lastovetsky
23 Mar 2022
Journal of Parallel and Distributed Computing | VOL. 165

Implementation and performance analysis of non-blocking collective operations for MPI
Torsten Hoefler ... Wolfgang Rehm
-
Torsten Hoefler, et. al.Torsten Hoefler ... Wolfgang Rehm
10 Nov 2007
10 Nov 2007

Autotuning MPI Collectives using Performance Guidelines
Sascha Hunold ... Alexandra Carpen-Amarie
-
Sascha Hunold, et. al.Sascha Hunold ... Alexandra Carpen-Amarie
28 Jan 2018
28 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The unexpected virtue of almost: Exploiting MPI collective operations to approximately coordinate checkpoints

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience