Abstract

As the clusters continue to grow in size and popularity, issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability. To address these issues, we design and implement a high availability parallel run-time system - ChaRM64 for MPI, a checkpoint-based rollback recovery and migration system for MPI programs on a cluster of IA-64 computers. Our approach integrates MPICH with a user-level, single process checkpoint/recovery library for IA-64 Linux, and modifies P4 libraries to implement a coordinated checkpointing and rollback recovery (CRR) and migration mechanism for parallel applications. In addition, the CRR of file operations is supported. Testing shows negligible performance overhead introduced by the CRR mechanism in our implementation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call