Abstract

Resilience and fault tolerance are challenging tasks in the field of high performance computing (HPC) and extreme scale systems. Components fail more often in such systems, results in application abort. Adopting fault–tolerance techniques can be consistently detect failures and continue application’s execution even if the failures exist. A prominent parallel programming specification, message passing interface (MPI), as it would be used to implement failure detection and consensus algorithm in this paper. Although the MPI does not facilitate fault tolerant behavior, this work presents a fault tolerant, matrix based failure detection and consensus algorithm. The proposed algorithm uses Gossiping. To detect failures, randomised pinging will be applied during the execution of the algorithm by using piggybacked gossip messages. In order to achieve consensus on the failures in the system, failed processes’ information will be sent using the same piggybacked gossip messages to all the alive processes. The algorithm was implemented in MPI framework and is completely fault tolerant. The results exhibit all the MPI process failures were detected using randomised pinging and global consensus has achieved on failed MPI process in the system.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call