Abstract

As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAM/MPI, a high performance implementation of the Message Passing Interface (MPI), to improve its availability. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and own higher portability, which can run on more platforms including IA32 and IA64 Linux. In addition, the test shows that less than 15% performance overhead is introduced by the CRR mechanism of our implementation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call