Recovery in Distributed Systems from Transient and Permanent Faults

Z Aliouat,M Aliouat

doi:10.3844/jcssp.2007.617.623

Abstract

The recovery mechanism from transient fault in distributed systems has been intensively studied in the past, but to our best knowledge, none of these studies has been devoted to cope together with transient and permanent hard faults. Our study devoted to recovery processes in a distributed environment in case of hard faults like transient or permanent. The recovery mechanism we presented can be based on one of the six proposed strategies involving checkpointing and message logging between distributed application processes. This exhaustive number is system-dependant. The strategies have been examined with respect to propagation recovery through processes in order to prevent the fastidious well known domino effect problem. The considered framework was a distributed system composed of a set of autonomous nodes running each one a local system; and some of them were predisposed to replace failing ones in case of permanent fault. Our main contribution was to enable a distributed application to meet its requirements of terminating its mission in spite of node crash. Preliminary experimental results of a fault tolerant mechanism based upon one of the proposed strategies demonstrated that our proposals seem to be conclusive.

Full Text