Abstract

To maintain the required service quality of time-critical cloud applications, operators must continuously monitor their runtime status, detect potential performance anomalies, and diagnose the root causes of these anomalies effectively. However, existing performance diagnosis methods face challenges such as the need for high-quality labeled data, the low reusability and robustness of performance anomaly detection models, and the absence of real-time fine-grained root cause localization. These challenges make fixing performance issues quickly and developing effective adaptation decisions difficult. We provide a Fine-grained Robust Performance Diagnosis (FIRED) framework to tackle those challenges. The framework offers a metrics selection component to filter noise and improve detection efficiency, an anomaly detection component that assembles several well-selected base models with a deep neural network, and adopts weakly supervised learning considering fewer labels exist in reality. The framework also employs a real-time, fine-grained root cause localization component to locate dependent resource metrics of performance anomalies. Our experiments show that the framework can effectively reduce data noise and achieve the best accuracy and algorithm robustness for performance anomaly detection. In addition, the framework can accurately localize the first root causes, with an average accuracy higher than 0.7 for locating the first four root cause metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call