Abstract

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Highlights

  • High Performance Computing (HPC) systems continue to grow exponentially in scale; currently from petascale computing (1015 floating point operations per second) to exascale computing (1018 floating point operations per second) as well as in complexity due to the growing need to handle long-running computational problems with effective techniques

  • The InfiniBand Architecture (IBA) may be the communication technology of the generation HPC systems; as of November 2011, InfiniBand connected systems represented more than 42 % of the systems in the Top500 list [33]

  • A large set of failure data was released by CFDR [10], comprising the failure statistics of 22 HPC systems, including a total of 4,750 nodes and 24,101 processors collected over a period of 9 years at Los Alamos National Laboratory (LANL)

Read more

Summary

Introduction

HPC systems continue to grow exponentially in scale; currently from petascale computing (1015 floating point operations per second) to exascale computing (1018 floating point operations per second) as well as in complexity due to the growing need to handle long-running computational problems with effective techniques. The total number of hardware components, the software complexity and overall system reliability, availability and serviceability (RAS) are factors to contend with in HPC systems, because hardware or software failure may occur while long-running parallel applications are being executed. The need for reliable fault tolerant HPC system has intensified because failure may result in a possible increase in execution time and cost of running the applications. Fault tolerance solutions are being incorporated into the HPC systems. Fault tolerant systems have the ability to contain failures when they occur, thereby minimizing the impact of failure. There is a need for further investigation of fault tolerance of HPC systems

Reliability and MTBF of HPC systems
Long-running applications and InfiniBand
Analysis of failure rates of HPC systems
Software failure rate
Hardware failure rate
Human caused failure rate
State of the art of fault tolerance techniques
Migration method
Redundancy
Failure masking
Failure semantics
Recovery
Rollback-recovery feature requirements for HPC systems
Checkpoint-based rollback-recovery mechanisms
Log-based rollback-recovery mechanisms
Taxonomy of checkpoint implementation
Reducing the time for saving the checkpoint in persistent storage
Findings
Summary
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call