Abstract
Fault tolerant MPI (FTMPI) enables fault tolerance to the MPICH, an open source GPL licensed implementation of MPI standard by Argonne National Laboratory's Mathematics and Computer Science Division. FTMPI is a transparent fault-tolerant environment, based on synchronous checkpointing and restarting mechanism. FTMPI relies on non-multithreaded single process checkpointing library to synchronously checkpoint an application process. Global replicated system controller and cluster node specific node controller monitors and controls check pointing and recovery activities of all MPI applications within the cluster. This work details the architecture to provide fault tolerance mechanism for MPI based applications running on clusters and the performance of NAS parallel benchmarks and parallelized medium range weather forecasting models, P-T80 and P-TI26. The architecture addresses the following issues also: Replicating system controller to avoid single point of failure. Ensuring consistency of checkpoint files based on distributed two phase commit protocol, and robust fault detection hierarchy.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.