Extensive experimentation with eACID

Shujaat Hussain,Muhammad Abdul Qadir

doi:10.1109/icet.2009.5353142

Abstract

Fault monitoring is one of the main activities of fault tolerant distributed systems. It is required to determine the suspected /crashed component and proactively take the recovery steps to keep the system alive. The main objective of the fault monitoring activity is to quickly and correctly identify the faults. A fault monitoring system which is quick to declare faults increases the chances of false alarms, i.e., declaration of a fault which is actually not a fault. Therefore, an ideal fault monitoring system needs to be as quick as possible in identification of faults without increasing the false alarms. Fault monitor typically detects faults by sending and receiving messages to remote objects and observing the time intervals between a message and its response. One of the major responsibilities of the monitor is to adapt these intervals according to the dynamic network and system conditions, and set them very close to the actual delays in the system. The adaptation of the delays, timeout and monitoring intervals, must not fluctuate with large amplitudes around the actual delays. Otherwise, the number of false alarms would increase or the identification of faults will be delayed. The adaptation should converge to the actual delays very fast. Adaptation of the monitoring interval in the same way as time outs adapt can not be defended. Sometimes, a distributed system (network or other components) may have abrupt change in their state for a very short duration (the transient behavior), the fault monitoring system should bypass these transients behavior, and otherwise the decisions taken on transients will have to be changed to other state very quickly which will add extra overheads both in taking the decision and then reverting it back. Our algorithm with the name of eACID (enhanced Adaptive Convergent Intelligent fault monitoring in Distributed systems), when compared with the best known algorithm, ADAPTATION [Sotama et al.], yielded 16% less false timeouts and 9% more utilization of responses. eACID adapts the timeout on the previous history which gives us a fair idea about the work load and we use it to our advantage. Our scheme does not take decisions on transient behaviors of the system.

Full Text