Abstract
To maintain the required service quality of time-critical cloud applications, operators must continuously monitor their runtime status, detect potential performance anomalies, and diagnose the root causes of these anomalies effectively. However, existing performance diagnosis methods face challenges such as the need for high-quality labeled data, the low reusability and robustness of performance anomaly detection models, and the absence of real-time fine-grained root cause localization. These challenges make fixing performance issues quickly and developing effective adaptation decisions difficult. We provide a Fine-grained Robust Performance Diagnosis (FIRED) framework to tackle those challenges. The framework offers a metrics selection component to filter noise and improve detection efficiency, an anomaly detection component that assembles several well-selected base models with a deep neural network, and adopts weakly supervised learning considering fewer labels exist in reality. The framework also employs a real-time, fine-grained root cause localization component to locate dependent resource metrics of performance anomalies. Our experiments show that the framework can effectively reduce data noise and achieve the best accuracy and algorithm robustness for performance anomaly detection. In addition, the framework can accurately localize the first root causes, with an average accuracy higher than 0.7 for locating the first four root cause metrics.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.