A Scalable, Non-Parametric Method for Detecting Performance Anomaly in Large Scale Computing

Li Yu,Zhiling Lan

doi:10.1109/tpds.2015.2475741

Abstract

As computer systems continue to grow in scale and complexity, performance problems become common and a major concern for large-scale computing. Performance anomalies caused by application bugs, hardware or software faults, or resource contention can have great impact on system-wide performance and could lead to significant economic losses for service providers. While many detection methods have been presented in the past, the newly emerging challenges are detection scalability and practical use. In this paper, we propose a scalable, non-parametric method for effectively detecting performance anomalies in large-scale systems. The design is generic for anomaly detection in a variety of parallel and distributed systems exhibiting peer-comparable property. It adopts a divide-and-conquer approach to address the scalability challenge and explores the use of non-parametric clustering and two-phase majority voting to improve detection flexibility and accuracy. We derive probabilistic models to quantitatively evaluate our decentralized design. Experiments with a suite of applications on production systems demonstrate that this method outperforms existing methods in terms of detection accuracy with a negligible runtime overhead.

Full Text