Dependable complex systems often operate under variable and non-stationary conditions, which requires efficient and extensive monitoring and error detection solutions. Among the many, the paper focuses on anomaly detection techniques, which monitor the evolution of some specific indicators through time to identify anomalies, i.e. deviations from the expected operational behavior. The timely identification of anomalies in dependable, fault tolerant systems allows to timely detect errors in the services and react appropriately. In this paper, we investigate the possibility to monitor the evolution of indicators through time using the random walk model on indicators belonging to Operating Systems, specifically in our study the Linux Red Hat EL5. The approach is based on the experimental evaluation of a large set of heterogeneous indicators, which are acquired under different operating conditions, both in terms of workload and faultload, on an air traffic management target system. The statistical analysis is based on a best-fitting approach aiming to minimize the integral distance between the empirical data distribution and some reference distributions. The outcomes of the analysis show that the idea of adopting a random walk model for the development of an anomaly detection monitor for critical systems that operates at Operating System level is promising. Moreover, standard distributions such as Laplace and Cauchy, rather than Normal, should be used for setting up the thresholds of the monitor. Further studies that involve a new application, a different Operating System and a new layer (an Application Server) will allow verifying the generalization of the approach to other fault tolerant systems, monitored layers and set of indicators.
Read full abstract