Abstract

Fault tolerance has become an important issue in parallel computing. It is often addressed at system level, but application-level approaches receive increasing attention. We consider a parallel programming pattern, the task pool, and provide a fault-tolerant implementation in a library. Specifically, our work refers to lifeline-based global load balancing, which is an advanced task pool variant that is implemented in the GLB framework of the parallel programming language X10. The variant considers side effect-free tasks whose results are combined into a final result by reduction. Our algorithm is able to recover from multiple fail-stop failures. If recovery is not possible, it halts with an error message. In the algorithm, each worker regularly saves its local task pool contents in the main memory of a backup partner. Backups are updated for steals. After failures, the backup partner takes over saved copies and collects others. In case of multiple failures, invocations of the restore protocol are nested. We have implemented the algorithm by extending the source code of the GLB library. In performance measurements on up to 256 places, we observed an overhead between 0.5% and 30%. The particular value depends on the application’s steal rate and task pool size. Sources of performance overhead have been further analyzed with a logging component.1

Highlights

  • Fault tolerance has become an important issue in parallel computing

  • Our work refers to lifeline-based global load balancing, which is an advanced task pool variant that is implemented in the GLB framework of the parallel programming language X10

  • For Resilient X10, the MPI backend is based on the User-Level Failure Mitigation (ULFM) extension of MPI [6] [8]

Read more

Summary

Fohry et al DOI

We consider a particular variant, which is called lifeline-based global load balancing, or shortly the lifeline scheme It is an advanced task pool variant with low communication costs and efficient termination detection. The lifeline scheme is used in the Global Load Balancing framework GLB [1] that is part of the standard library of the parallel programming language X10. It targets distributed-memory architectures in the Partitioned Global Address Space (PGAS) setting. The algorithm is correct in the sense that it either outputs the correct result or halts with an error message Beyond that, it is robust, i.e., program aborts are rare.

Background
Lifeline Scheme
GLB Framework
Fault-Tolerant Algorithm
Algorithm in Failure-Free Operation
Restore after Single Failure
Nested Restore
Specific Timeouts
Variants of Algorithm
Correctness
Robustness and Efficiency
Experiments
Related Work
Conclusions
Findings
General
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call