A Checkpointing/Recovery System for MPI Applications on Cluster of IA-64 Computers

Youhui Zhang Youhui Zhang,Ruini Xue Ruini Xue,Dongsheng Wong Dongsheng Wong,Weimin Zheng Weimin Zheng

doi:10.1109/icppw.2005.5

Abstract

As the clusters continue to grow in size and popularity, issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability. To address these issues, we design and implement a high availability parallel run-time system - ChaRM64 for MPI, a checkpoint-based rollback recovery and migration system for MPI programs on a cluster of IA-64 computers. Our approach integrates MPICH with a user-level, single process checkpoint/recovery library for IA-64 Linux, and modifies P4 libraries to implement a coordinated checkpointing and rollback recovery (CRR) and migration mechanism for parallel applications. In addition, the CRR of file operations is supported. Testing shows negligible performance overhead introduced by the CRR mechanism in our implementation.

Full Text