Monitoring and Mitigating Software Aging on IBM Cloud Controller System

Harish Sukhwani,Andy Rindos,Kishor S Trivedi,Rivalino Matias

doi:10.1109/issrew.2017.65

Abstract

As enterprises continue to move their workloads from traditional server-room environments to private cloud-based systems, there is an increasing desire and ability for companies like IBM to centrally monitor the systems on behalf of their customers to proactively help to mitigate any potential failure scenarios. In this paper, we investigate failures caused by software aging affecting an enterprise-class cloud controller system. We describe a service developed to continuously analyze the key system/application metrics from customer systems, identifies potential aging-related failure scenarios within the next two days, and generates a list of tasks for the development-operations team at IBM to mitigate the potential failures. To help the team prioritize the tasks, we propose a prioritization scheme to assign severity to such tasks. From our analysis of two months of offline data, we find that the tasks generated have a precision of around 0.80 and recall of 1, which means that our approach did not miss any aging-related failure event, with around 80% of the failure events being true.

Full Text