EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Sourav Chakraborty,Martin Schulz,Kathryn Mohror,Dhabaleswar K Panda,Ignacio Laguna,Hari Subramoni,Murali Emani

doi:10.1002/cpe.4863

Abstract

SummaryScientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart model has been proposed to decrease the time of failure recovery in Bulk‐Synchronous applications by allowing a fast reinitialization of MPI. However, the current implementations of this model have several drawbacks: they lack efficiency; their scalability have not been shown; and they require the use of the MPI profiling interface, which precludes the use of tools. In this paper, we present EReinit, an implementation of the global‐restart model that addresses these problems. Our key idea and optimization is the co‐design of basic fault‐tolerance mechanisms such as failure detection, notification, and recovery between MPI and the resource manager in contrast to current approaches on which these mechanisms are implemented in MPI only. We demonstrate EReinit in three HPC programs and show that it is up to four times more efficient than existing solutions at 4,096 processes.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience

Lead the way for us

Journal: Concurrency and Computation: Practice and Experience	Publication Date: Aug 14, 2018
Citations: 20

Similar Papers

Profile-based power shifting in interconnection networks with on/off links
Shinobu Miwa ... Hiroshi Nakamura
-
Shinobu Miwa, et. al.Shinobu Miwa ... Hiroshi Nakamura
15 Nov 2015
15 Nov 2015

Fail-stop Failure Recovery in Neighbor Replica Environment
Ahmad Shukri Mohd Noor ... Mustafa Mat Deris
Procedia Computer Science | VOL. 19
Ahmad Shukri Mohd Noor, et. al.Ahmad Shukri Mohd Noor ... Mustafa Mat Deris
01 Jan 2013
Procedia Computer Science | VOL. 19

Characteristic Analysis of Applications for Designing a Future HPC System
Osamu Watanabe ... Akihiro Musa
-
Osamu Watanabe, et. al.Osamu Watanabe ... Akihiro Musa
03 Nov 2014
03 Nov 2014

Design and Implementation of High Availability OSPF Router
...
Journal of Information Science and Engineering | VOL. 26
, et. al. ...
01 Nov 2010
Journal of Information Science and Engineering | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

EReinit: Scalable and efficient fault‐tolerance for bulk‐synchronous MPI applications

Abstract

Talk to us

Similar Papers

More From: Concurrency and Computation: Practice and Experience