Checkpointing Protocol Research Articles

The main issues when supporting fault tolerance based on checkpointing and rollback recovery for High-Performance applications are related to the scalability of the introduced support, the possibility of analyzing the induced overhead and, in more general terms, the optimization of the trade-off between failure-free and recovery performances. In this paper we describe our contribution in fault tolerance for high-level structured parallelism models. We take a different viewpoint w.r.t. existing contributions, by introducing a methodology to derive interesting properties to support fault tolerance. We show how to apply this methodology to a general data parallel model, deriving useful properties to introduce a class of checkpointing protocols. Thanks to this methodology, this class of protocols is not affected by the described issues. We exemplify two checkpointing protocols and the related rollback recovery techniques. For each protocol we also derive cost models statically describing the failure-free performance, which can be used for performance tuning or to target some Quality of Service parameter. To assess the innovation of the results we analytically and experimentally compare the introduced protocols with two literature protocols. Results show that while the protocols introduced in this paper permit the definition of cost models and have a good scalability, the literature protocols do not always have these properties. Copyright © 2010 John Wiley & Sons, Ltd.

Read full abstract

An efficient and reliable fault tolerance protocol plays a key role in High Performance Computing. Rollback recovery is the most common fault tolerance technique used in High Performance Computing and especially in MPI applications. This technique relies on the reliability of the checkpoint storage. Most of the rollback recovery protocols assume that the checkpoint servers machines are reliable. However, in a grid environment any unit can fail at any moment, including components used to connect different administrative domains. Such failures lead to the loss of a whole set of machines, including the more reliable machines used to store the checkpoints in this administrative domain. Thus it is not safe to rely on the high Mean Time Between Failures of specific machines to store the checkpoint images.This paper introduces a new coordinated checkpoint protocol, which tolerates checkpoint server failures and clusters failures, and ensures a checkpoint storage reliability in a grid environment. To provide this reliability the protocol is based on a replication process. We propose new hierarchical replication strategies that exploit the locality of checkpoint images in order to minimize inter-cluster communication.We evaluate the effectiveness of our two hierarchical replication strategies through simulations against several criteria such as topology and scalability.

Read full abstract

Checkpointing Protocol Research Articles

Related Topics

Articles published on Checkpointing Protocol

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Unified model for assessing checkpointing protocols at extreme‐scale

Towards an energy estimator for fault tolerance protocols

Achieving Checkpointing Global Consistency Through a Hybrid Compile Time and Runtime Protocol

Scalable Checkpointing-Based Rollback Recovery Protocol for Geographically Distributed Systems

System Progress Estimation in Time based Coordinated Checkpointing Protocols

A multi-cycle checkpointing protocol that ensures strict 1-rollback

An Efficient Coordinated Checkpointing Approach for Distributed Computing Systems with Reliable Channels

Message efficient global snapshot recording using a self stabilizing spanning tree in a MANET

Independent checkpointing in a heterogeneous grid environment

Theoretical and experimental evaluation of communication-induced checkpointing protocols in [formula omitted] and [formula omitted] families

Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Fault tolerance for data parallel programs

Design and Performance Analysis of Coordinated Checkpointing Algorithms for Distributed Mobile Systems

A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing Systems

Anti-message Logging Based Coordinated Checkpointing Protocol for Deterministic Mobile Computing Systems

A Low-overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System

Real Time Snapshot Collection Algorithm for Mobile Distributed Systems with Minimum Number of Checkpoints

A weighted checkpointing protocol for mobile distributed systems

HIERARCHICAL REPLICATION TECHNIQUES TO ENSURE CHECKPOINT STORAGE RELIABILITY IN GRID ENVIRONMENT

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Checkpointing Protocol Research Articles

Related Topics

Articles published on Checkpointing Protocol

A fully informed model-based checkpointing protocol for preventing useless checkpoints

Unified model for assessing checkpointing protocols at extreme‐scale

Towards an energy estimator for fault tolerance protocols

Achieving Checkpointing Global Consistency Through a Hybrid Compile Time and Runtime Protocol

Scalable Checkpointing-Based Rollback Recovery Protocol for Geographically Distributed Systems

System Progress Estimation in Time based Coordinated Checkpointing Protocols

A multi-cycle checkpointing protocol that ensures strict 1-rollback

An Efficient Coordinated Checkpointing Approach for Distributed Computing Systems with Reliable Channels

Message efficient global snapshot recording using a self stabilizing spanning tree in a MANET

Independent checkpointing in a heterogeneous grid environment

Theoretical and experimental evaluation of communication-induced checkpointing protocols in [formula omitted] and [formula omitted] families

Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Fault tolerance for data parallel programs

Design and Performance Analysis of Coordinated Checkpointing Algorithms for Distributed Mobile Systems

A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing Systems

Anti-message Logging Based Coordinated Checkpointing Protocol for Deterministic Mobile Computing Systems

A Low-overhead Minimum Process Coordinated Checkpointing Algorithm for Mobile Distributed System

Real Time Snapshot Collection Algorithm for Mobile Distributed Systems with Minimum Number of Checkpoints

A weighted checkpointing protocol for mobile distributed systems

HIERARCHICAL REPLICATION TECHNIQUES TO ENSURE CHECKPOINT STORAGE RELIABILITY IN GRID ENVIRONMENT