Checkpointing Research Articles

Current high-end parallel systems consist of hundreds of thousands of compute cores arranged in a complex hierarchical structure; future systems will have millions of cores. Systems, such as the Altix 4700, Blue Gene, Roadrunner, and Cray XT5, deploy multiple compute cores (homogeneous or heterogeneous) with multiple levels of shared and private caches within a processor, clustered into SMP nodes and coupled via a communication network to large-scale distributed systems. The development of efficient programs is extremely complex since the architectural details are exposed to the programmer. Productive use of such machines requires highly scalable programming tools for debugging, performance analysis, and fault tolerance. In addition, new programming models might significantly facilitate the task of the programmer. This special issue of Concurrency and Computation: Practice and Experience is devoted to programming tools that facilitate the development of efficient programs for such large-scale architectures. It is a collection of the best papers submitted to the international workshop on Scalable Tools for High-end Computing (STHEC 2008) that was held in conjunction with the International Conference on Supercomputing on June 7th on the Greek Island Kos. The papers present state-of-the-art tools for performance analysis and checkpointing on those machines. Performance analysis tools use measurements gathered during the execution of the application to detect portions of the code that can be further improved. Thus, they have to be able to cope with the large number of processors. Tools for checkpointing provide the possibility to restart an application in the case of a system failure; they have to be able to handle large number of cores as well. The selected papers present different techniques for building tools that will scale to thousands of cores. HPCToolkit 1 is a profiling-based performance analysis environment presenting the data in close relation to the source code without requiring an instrumentation of the source code. Scalasca 2 performs a parallel replay of the execution on the application's processors to find performance bottlenecks automatically. The combination of TAU and MRNet 3 provides a scalable infrastructure to offload performance data. Establishing the overlay network requires no added support from the job manager or application. Periscope 4 is based on a network of analysis agents that performs an online analysis of the application's performance behavior. When the application is started, additional processors can be allocated for the analysis agents to scale the analysis. CPPC 5 is a tool for portable checkpointing of message-passing applications. It consists of a runtime library and a compiler that assists the user by performing time-consuming tasks, such as data flow and communications analyses as well as code instrumentation. We would like to thank the authors for their excellent contributions to this special issue. We hope that it inspires future research in tools that support programmers of high-end systems in the development of efficient programs.

All existing fault-tolerance job scheduling algorithms for computational grids were proposed under the assumption that all sites apply the same fault-tolerance strategy. They all ignored that each grid site may have its own fault-tolerance strategy because each site is itself an autonomous domain. In fact, it is very common that there are multiple fault-tolerance strategies adopted at the same time in a large-scale computational grid. Various fault-tolerance strategies may have different hardware and software requirements. For instance, if a grid site employs the job checkpointing mechanism, each computation node must have the following ability. Periodically, the computational node transmits the transient state of the job execution to the server. If a job fails, it will migrate to another computational node and resume from the last stored checkpoint. Therefore, in this paper we propose a genetic algorithm for job scheduling to address the heterogeneity of fault-tolerance mechanisms problem in a computational grid. We assume that the system supports four kinds fault-tolerance mechanisms, including the job retry, the job migration without checkpointing, the job migration with checkpointing, and the job replication mechanisms. Because each fault-tolerance mechanism has different requirements for gene encoding, we also propose a new chromosome encoding approach to integrate the four kinds of mechanisms in a chromosome. The risk nature of the grid environment is also taken into account in the algorithm. The risk relationship between jobs and nodes are defined by the security demand and the trust level. Simulation results show that our algorithm has shorter makespan and more excellent efficiencies on improving the job failure rate than the Min–Min and sufferage algorithms.

Checkpointing Research Articles

Related Topics

Articles published on Checkpointing

Correlated set coordination in fault tolerant message logging protocols for many‐core clusters

Mementos

Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications

Fault Tolerance and Recovery for Grid Application Reliability using Check Pointing Mechanism

Mementos

Mementos

Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study

Fault Tolerance In Grid Computing: State of the Art and Open Issues

Distributed Speculative Parallelization using Checkpoint Restart

Coordinated Checkpointing Algorithms for Mobile Agent Environments

A Recovery Scheme of a Cluster Head Failure for Underwater Wireless Sensor Networks

Fault tolerance for data parallel programs

Checkpointing with Synchronized Clocks in Distributed Systems

Grid and Cloud Computing and their Application

Special Issue: Scalable Tools for High‐end Computing

Algorithm Based Fault Tolerant and Check Pointing for High Performance Computing Systems

An integrated security-aware job scheduling strategy for large-scale computational grids

Special Issue: Advanced Strategies in Grid Environments—Models and Techniques for Scheduling and Programming

Network‐aware selective job checkpoint and migration to enhance co‐allocation in multi‐cluster systems

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Checkpointing Research Articles

Related Topics

Articles published on Checkpointing

Correlated set coordination in fault tolerant message logging protocols for many‐core clusters

Mementos

Athanasia: A User-Transparent and Fault-Tolerant System for Parallel Applications

Fault Tolerance and Recovery for Grid Application Reliability using Check Pointing Mechanism

Mementos

Mementos

Analysis of Performance-impacting Factors on Checkpointing Frameworks: The CPPC Case Study

Fault Tolerance In Grid Computing: State of the Art and Open Issues

Distributed Speculative Parallelization using Checkpoint Restart

Coordinated Checkpointing Algorithms for Mobile Agent Environments

A Recovery Scheme of a Cluster Head Failure for Underwater Wireless Sensor Networks

Fault tolerance for data parallel programs

Checkpointing with Synchronized Clocks in Distributed Systems

Grid and Cloud Computing and their Application

Special Issue: Scalable Tools for High‐end Computing

Algorithm Based Fault Tolerant and Check Pointing for High Performance Computing Systems

An integrated security-aware job scheduling strategy for large-scale computational grids

Special Issue: Advanced Strategies in Grid Environments—Models and Techniques for Scheduling and Programming

Network‐aware selective job checkpoint and migration to enhance co‐allocation in multi‐cluster systems

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids