Using resource monitoring to select recovery strategies

R Tirtea,R Belmans,V De Florio,G Deconinck

doi:10.1109/rams.2004.1285459

Abstract

Distributed heterogeneous embedded systems involved in the control of infrastructures, such as electric power infrastructure, need to ensure reliable services regardless of faults and changes in the environment. A fault tolerance middleware architecture containing mechanisms for adaptation of quality-of-service (QoS) is developed to assure dependable control of the components of the infrastructure. Recovery strategies are used to allow reconfiguration of the system (e.g. graceful degradation) based on the circumstances of the failure. In this paper we present why and how available resources should be also considered together with the type of failure and the circumstances of the failure in the selection of recovery strategy. Changes in the environment such as lower resources at node levels (e.g. overload of the systems) or degradation of QoS (e.g. scarce of bandwidth in case of communication links) should be considered before allocating a new process/task to another host or before taking reconfiguration decisions. A mathematical model for generating a composite indicator based on sampled parameters is presented. The mechanism for monitoring resources at the node level is described and it is presented how this can be used in the selection of a recovery action (e.g. restart/migration of processes on/to overloaded nodes should be avoided). Also, based on the communication characteristics between distributed sites (e.g. depending availability or on cost), different recovery strategies can be selected. For this paper we consider the case of two recovery strategies and we present a mechanism for selecting the appropriate one. The fault-tolerant architecture integrating the QoS monitoring mechanism achieves dynamic reconfiguration of the recovery strategies based on the changes in the environment. Also, the QoS monitoring mechanism increases the differentiation between node crash and network problems for failure suspected nodes. Another advantage of using this mechanism is the dynamic adaptation of resource allocation for an overall increase in application availability.

Full Text