MMPI: A Scalable Fault Tolerance Mechanism for MPI Large Scale Parallel Computing

Zhiyuan Wang,Xuejun Yang,Yun Zhou

doi:10.1109/cit.2010.226

Abstract

At present, Checkpoint/Restart is one of the most popular fault tolerance mechanisms for large scale parallel computing. However, the time to save a global checkpoint reaches and even exceeds the mean-time-between-failures (MTBF) of the component when the performance of the system is between Peta(10^{15}) and Exa(10^{18}) flops, which limits the scalability of the parallel computing. In this paper, a scalable fault tolerance mechanism is designed for MPI-oriented large scale parallel computing, which not only can deal with the fail-stop faults concerned by Checkpoint/Restart, but also can deal with most data errors that are not perceived by hardware. Firstly, we define the concept of redundant-process cluster (RPC), design running techniques that support MMPI, and study the implementation of MMPI. Secondly, we present the models of fault tolerance parallel speedup, Lastly, we verify the validity and scalability of MMPI fault tolerance mechanism.

Full Text