Abstract
Ensuring proper quality of service (QoS) is essential for cloud service providers and customers alike. To this end, cloud systems must rely as much as possible on automated and efficient methods of monitoring, introspection, and recovery. In particular, automated recovery is essential to ensure long-term reliability and availability because human intervention is too slow and not every situation can be anticipated. In turn, automated recovery requires both efficient monitoring and accurate identification of root causes to ensure that the same causes will not lead to failures in the future. Current cloud systems use an in-memory time-series database for dynamic analysis or aggregation purposes. When done at all, root cause analysis serves the convenience of reporting and does not need to be very accurate. As a result, recent studies lack details on how to accurately find root causes from time-series monitoring data. This study proposes a novel event-driven monitoring rule inference method based on dynamic case-based reasoning and shape-based root cause analysis. It is designed for autonomous recovery so as to guarantee long-term QoS of cloud systems. The accuracy and performance of the approach are evaluated using realistic monitoring data combining more than a decade of experience as a major cloud service provider (Yahoo). The results show that our approach makes effective use of monitoring data in improving overall QoS and hence opens interesting directions.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.