A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Ifeanyi P Egwutuoha,David Levy,Bran Selic,Shiping Chen

doi:10.1007/s11227-013-0884-0

Ifeanyi P Egwutuoha, David Levy + Show 2 more

Open Access

https://doi.org/10.1007/s11227-013-0884-0

Copy DOI

Journal: The Journal of Supercomputing	Publication Date: Feb 12, 2013
Citations: 253	License type: cc-by

Affiliation: University of Sydney

Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Highlights

High Performance Computing (HPC) systems continue to grow exponentially in scale; currently from petascale computing (1015 floating point operations per second) to exascale computing (1018 floating point operations per second) as well as in complexity due to the growing need to handle long-running computational problems with effective techniques
The InfiniBand Architecture (IBA) may be the communication technology of the generation HPC systems; as of November 2011, InfiniBand connected systems represented more than 42 % of the systems in the Top500 list [33]
A large set of failure data was released by CFDR [10], comprising the failure statistics of 22 HPC systems, including a total of 4,750 nodes and 24,101 processors collected over a period of 9 years at Los Alamos National Laboratory (LANL)

Summary

Introduction

HPC systems continue to grow exponentially in scale; currently from petascale computing (1015 floating point operations per second) to exascale computing (1018 floating point operations per second) as well as in complexity due to the growing need to handle long-running computational problems with effective techniques. The total number of hardware components, the software complexity and overall system reliability, availability and serviceability (RAS) are factors to contend with in HPC systems, because hardware or software failure may occur while long-running parallel applications are being executed. The need for reliable fault tolerant HPC system has intensified because failure may result in a possible increase in execution time and cost of running the applications. Fault tolerance solutions are being incorporated into the HPC systems. Fault tolerant systems have the ability to contain failures when they occur, thereby minimizing the impact of failure. There is a need for further investigation of fault tolerance of HPC systems

Reliability and MTBF of HPC systems

Long-running applications and InfiniBand

Analysis of failure rates of HPC systems

Software failure rate

Hardware failure rate

Human caused failure rate

State of the art of fault tolerance techniques

Migration method

Redundancy

Failure masking

Failure semantics

Recovery

Rollback-recovery feature requirements for HPC systems

Checkpoint-based rollback-recovery mechanisms

Log-based rollback-recovery mechanisms

Taxonomy of checkpoint implementation

Reducing the time for saving the checkpoint in persistent storage

Findings

Summary

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Similar Papers

Design of robust scheduling methodologies for high performance computing

-

01 Jan 2019
01 Jan 2019

Fault Tolerance to Balance for Messaging Layers in Communication Society
Abrosimov Mikhail ... Hemant Mahajan
-
Abrosimov Mikhail, et. al.Abrosimov Mikhail ... Hemant Mahajan
01 Aug 2017
01 Aug 2017

Evaluation of process level redundant checkpointing/restart for HPC systems
Ifeanyi P Egwutuoha ... David Levy
-
Ifeanyi P Egwutuoha, et. al.Ifeanyi P Egwutuoha ... David Levy
01 Nov 2011
01 Nov 2011

Code Modernization Tools for Assisting Users in Migrating to Future Generations of Supercomputers
Ritu Arora ... Lars Koesterke
-
Ritu Arora, et. al.Ritu Arora ... Lars Koesterke
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Journal of Supercomputing