Abstract

As enterprises continue to move their workloads from traditional server-room environments to private cloud-based systems, there is an increasing desire and ability for companies like IBM to centrally monitor the systems on behalf of their customers to proactively help to mitigate any potential failure scenarios. In this paper, we investigate failures caused by software aging affecting an enterprise-class cloud controller system. We describe a service developed to continuously analyze the key system/application metrics from customer systems, identifies potential aging-related failure scenarios within the next two days, and generates a list of tasks for the development-operations team at IBM to mitigate the potential failures. To help the team prioritize the tasks, we propose a prioritization scheme to assign severity to such tasks. From our analysis of two months of offline data, we find that the tasks generated have a precision of around 0.80 and recall of 1, which means that our approach did not miss any aging-related failure event, with around 80% of the failure events being true.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call