Abstract

At present, Checkpoint/Restart is one of the most popular fault tolerance mechanisms for large scale parallel computing. However, the time to save a global checkpoint reaches and even exceeds the mean-time-between-failures (MTBF) of the component when the performance of the system is between Peta(10^{15}) and Exa(10^{18}) flops, which limits the scalability of the parallel computing. In this paper, a scalable fault tolerance mechanism is designed for MPI-oriented large scale parallel computing, which not only can deal with the fail-stop faults concerned by Checkpoint/Restart, but also can deal with most data errors that are not perceived by hardware. Firstly, we define the concept of redundant-process cluster (RPC), design running techniques that support MMPI, and study the implementation of MMPI. Secondly, we present the models of fault tolerance parallel speedup, Lastly, we verify the validity and scalability of MMPI fault tolerance mechanism.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call