A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Hai Jiang,Jeff Jennes,Kuan-Ching Li,Yulu Zhang

doi:10.2991/ijndc.2013.1.4.2

Hai Jiang, Jeff Jennes + Show 2 more

Open Access

https://doi.org/10.2991/ijndc.2013.1.4.2

Copy DOI

Abstract

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme.

Highlights

High Performance Computing (HPC) systems are usually used to solve more complex problems and many long-running HPC applications are more likely to encounter failures than regular applications
All Graphics Processing Unit (GPU) checkpoint/restart experiments are conducted based on NVIDIA Fermi GPU, Tesla C2050
In order to reduce the overhead of CheCUDA, Supada Laosookasathit proposed a lightweight checkpoint/restart using CUDA streams based on Virtual Cluster Checkpointing Protocol (VCCP) 29

Summary

Introduction

High Performance Computing (HPC) systems are usually used to solve more complex problems and many long-running HPC applications are more likely to encounter failures than regular applications. The NVIDIA GPU product families such as Tesla, Fermi and Kepler were designed from the ground up for parallel computing/programming and offer exclusive high performance computing features[4]. Checkpoint/restart for traditional CPU computations can be accomplished at three levels: kernel, library and application levels where computation states are acquired or constructed[6]. Without operating system and system call/library support, we can only adopt application-level checkpoint/restart approach for GPUs. This paper intends to propose a checkpoint/restart scheme that consists of a precompiler and a run-time support module. A new infrastructure is developed so that the precompiler transforms both CPU and GPU source code and inserts library calls for the run-time support module to construct computation states dynamically.

CUDA Programming

GPU Memory Hierarchy

Host-side Code Transformation

Device-side Code Transformation

Run-time Support Module

Buffer Allocation

State Registration

State Saving

State Restoration

File Management

Data Structure Deletion

Experimental Results

Related Work

Conclusions and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Networked and Distributed Computing	Publication Date: Jan 1, 2013
Citations: 7	License type: cc-by

R Discovery Prime

R Discovery Prime

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Networked and Distributed Computing

Lead the way for us

Similar Papers

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy
Xinyuan Guo ... Hai Jiang
-
Xinyuan Guo, et. al.Xinyuan Guo ... Hai Jiang
01 Jul 2013
01 Jul 2013

Towards constructing application-level GPU computation states
Yulu Zhang ... Kuan-Ching Li
-
Yulu Zhang, et. al.Yulu Zhang ... Kuan-Ching Li
01 Jun 2013
01 Jun 2013

Scientific Grand Challenges: Forefront Questions in Nuclear Science and the Role of High Performance Computing
Mohammad A Khaleel
-
Mohammad A KhaleelMohammad A Khaleel
01 Oct 2009
01 Oct 2009

Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment
Fatiha Bouabache ... Thomas Herault
-
Fatiha Bouabache, et. al.Fatiha Bouabache ... Thomas Herault
01 May 2008
01 May 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Networked and Distributed Computing