Optimizing Failure Prediction to Maximize Availability

Igor Kaitovic,Miroslaw Malek

doi:10.1109/icac.2016.48

Abstract

Availability of autonomous systems can be enhanced with self-monitoring and fault-tolerance methods based on failures prediction. With each correct prediction, proactive actions may be taken to prevent or to mitigate a failure. On the other hand, incorrect predictions will introduce additional downtime associated with the overhead of a proactive action that may decrease availability. The total effect on availability will depend on the quality of prediction (measured with precision and recall), the overhead of proactive actions (penalty), and the benefit of proactive actions when prediction is correct (reward). In this paper, we quantify the impact of failure prediction and proactive actions on steady-state availability. Furthermore, we provide guidelines for optimizing failure prediction to maximize availability by selecting a proper precision and recall trade-off with respect to penalty and reward. A case study to demonstrate the approach is also presented.

Full Text