Checkpoint Placement Research Articles

Non-volatile devices, such as SSDs, will be an integral part of the deepening storage hierarchy on large-scale HPC systems. These devices can be on the compute nodes as part of a distributed burst buffer service or they can be external. Wherever they are located in the hierarchy, one critical design issue is the SSD endurance under the write-heavy workloads, such as the checkpoint I/O for scientific applications. For these environments, it is widely assumed that checkpoint operations can occur once every 60 min and for each checkpoint step as much as half of the system memory can be written out. Unfortunately, for large-scale HPC applications, the burst buffer SSDs can be worn out much more quickly given the extensive amount of data written at every checkpoint step. One possible solution is to control the amount of data written by reducing the checkpoint frequency. However, a direct effect caused by reduced checkpoint frequency is the increased vulnerability window of system failures and therefore potentially wasted computation time, especially for large-scale compute jobs.In this paper, we propose a new checkpoint placement optimization model which collaboratively utilizes both the burst buffer and the parallel file system to store the checkpoints, with design goals of maximizing computation efficiency while guaranteeing the SSD endurance requirements. Moreover, we present an adaptive algorithm which can dynamically adjust the checkpoint placement based on the system’s dynamic runtime characteristics and continuously optimize the burst buffer utilization. The evaluation results show that by using our adaptive checkpoint placement algorithm we can guarantee the burst buffer endurance with at most 5% performance degradation per application and less than 3% for the entire system.

Read full abstract

체크포인터를 삽입한 실시간 시스템에서는 고장이 발생하면 고장 직전의 체크포인터로 회귀하여 태스크를 재실행함으로써 과도 고장을 효과적으로 극복할 수 있다. 이번 논문에서는 체크포인터에서 실행되는 데이터 저장과 고장 탐지 과정을 분리한 새로운 체크포인터 방식을 제안한다. 하나의 체크포인터 구간 내에 여러 개의 고장 탐지 과정을 추가하면 고장 발생에서 탐지까지의 지연 시간을 줄일 수 있다. 본 논문에서는 태스크가 데드라인 이내에서 성공적으로 수행될 확률을 최대화하는 고장 탐지 과정의 삽입 방법을 제안한다. 고장 탐지 과정이 분리된 체크포인터 방식을 마코프 체인으로 모델링하고 실시간 태스크의 성공적 수행 확률을 계산하는 모의실험을 수행하여 최적의 해를 구하는 과정을 제시한다. Checkpoint placement is an effective fault tolerance technique against transient faults in which the task is re-executed from the latest checkpoint when a fault is detected. In this paper, we propose a new checkpoint placement strategy separating data saving and fault detection processes that are performed together in conventional checkpoints. Several fault detection processes are performed in one checkpoint interval in order to decrease the latency between the occurrence and detection of faults. We address the placement method of fault detection processes to maximize the probability of successful execution of a task within the given deadline. We develop the Markov chain model for a real-time task having the proposed checkpoints, and derive the optimal fault detection and checkpoint interval.

Read full abstract

Checkpoint Placement Research Articles

Articles published on Checkpoint Placement

Fault tolerance based load balancing approach for web resources

Optimum checkpoints for programs with loops

Reliability Hardening Mechanisms in Cyber-Physical Digital-Microfluidic Biochips

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

분산 고장 탐지 방식을 이용한 실시간 태스크에서의 최적 체크포인터 구간 선정

Online Checkpointing with Improved Worst-Case Guarantees

PROBLEMS OF QUALITY CONTROL DURING TRANSPORTATION OF PERISHABLE GOODS

Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System

Probabilistic optimisation of checkpoint intervals for real-time multi-tasks

Aperiodic Checkpoint Placement Algorithms—Survey and Comparison

Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods

데드라인이 주기보다 긴 멀티 태스크를 가진 실시간 시스템을 위한 최적 체크포인트 배치

Issues Involving the Placement of Watchmen Inside Early Modern Checkpoints : Particularly Concerning the Hakone Checkpoint

Software Assistants for Randomized Patrol Planning for the LAX Airport Police and the Federal Air Marshal Service

Complexity of Computing Optimal Stackelberg Strategies in Security Resource Allocation Games

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication

Numerical computation algorithms for sequential checkpoint placement

A DP-BASED CHECKPOINTING SCHEME IN REAL-TIME APPLICATIONS

Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

The interplay of power management and fault recovery in real-time systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Checkpoint Placement Research Articles

Articles published on Checkpoint Placement

Fault tolerance based load balancing approach for web resources

Optimum checkpoints for programs with loops

Reliability Hardening Mechanisms in Cyber-Physical Digital-Microfluidic Biochips

Optimizing checkpoint data placement with guaranteed burst buffer endurance in large-scale hierarchical storage systems

분산 고장 탐지 방식을 이용한 실시간 태스크에서의 최적 체크포인터 구간 선정

Online Checkpointing with Improved Worst-Case Guarantees

PROBLEMS OF QUALITY CONTROL DURING TRANSPORTATION OF PERISHABLE GOODS

Static Analysis for the Placement of Application-Level Checkpoints on Heterogeneous System

Probabilistic optimisation of checkpoint intervals for real-time multi-tasks

Aperiodic Checkpoint Placement Algorithms—Survey and Comparison

Optimal Checkpoint Placement on Real-Time Tasks with Harmonic Periods

데드라인이 주기보다 긴 멀티 태스크를 가진 실시간 시스템을 위한 최적 체크포인트 배치

Issues Involving the Placement of Watchmen Inside Early Modern Checkpoints : Particularly Concerning the Hakone Checkpoint

Software Assistants for Randomized Patrol Planning for the LAX Airport Police and the Federal Air Marshal Service

Complexity of Computing Optimal Stackelberg Strategies in Security Resource Allocation Games

Design Optimization of Time- and Cost-Constrained Fault-Tolerant Embedded Systems With Checkpointing and Replication

Numerical computation algorithms for sequential checkpoint placement

A DP-BASED CHECKPOINTING SCHEME IN REAL-TIME APPLICATIONS

Distribution-Free Checkpoint Placement Algorithms Based on Min-Max Principle

The interplay of power management and fault recovery in real-time systems