On the Applicability of Machine Learning-based Online Failure Prediction for Modern Complex Systems

Joao R Campos,Marco Vieira,Ernesto Costa

doi:10.1109/edcc57035.2022.00019

Abstract

The complexity of modern software makes it impossible to detect all faults before deployment, which can ultimately lead to failures during operation. Online Failure Prediction (OFP) is a technique that intends to mitigate the effects of residual faults by predicting incoming failures. Taking advantage of technological developments, recent studies have shown that, while challenging, it is possible to use Machine Learning (ML) to develop accurate failure predictors for modern complex systems. However, predictive performance is not enough for OFP to be accepted as a viable alternative: it is also necessary to assess how sensitive the predictors are to the data used for training and how they can perform in different contexts such as variations in the workload, both in terms of avoiding false alerts and predicting failures. This practical experience report presents a detailed analysis of the performance of predictors under varying scenarios to answer a key question: can OFP be used in practice? Results suggest that ML-based predictors can tolerate variations without raising false alerts while keeping predictive accuracy, thus supporting the idea that OFP can be used in practice and should thus be further researched.

Full Text