Abstract

A long-term trend in high performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Hence, fault tolerance becomes a key property for parallel application running on parallel computing systems. The message passing interface (MPI) is currently the programming paradigm and communication library most commonly used on parallel computing platforms. MPI applications may be stopped at any time during their execution due to an unpredictable failure. In order to avoid complete restarts of an MPI application because of only one failure, a fault tolerant MPI implementation is essential. In this paper, we propose a fault tolerant approach in cluster computing system. Our approach is based on reassignment of tasks to the remaining system and message logging is used for message losses. This system consists of two main parts, failure diagnosis and failure recovery. Failure diagnosis is the detection of a failure and failure recovery is the action needed to take over the workload of a failed component. This fault tolerant approach is implemented as an extension of the message passing interface.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call