In a vast majority of cases, remediation of IT issues encoded into domain-specific or user-defined alerts occurring in cloud environments and customer ecosystems suffers from accurate recommendations, which could be supplied in a timely manner for recovery of performance degradations. This is hard to realize by furnishing those abnormality definitions with appropriate expert knowledge, which varies from one environment to another. At the same time, in many support cases, the reported problems under Global Support Services (GSS) or Site Reliability Engineering (SRE) treatment ultimately go down to the product teams, making them waste costly development hours on investigating self-monitoring metrics of our solutions. Therefore, the lack of a systematic approach to adopting AI Ops significantly impacts the mean-time-to-resolution (MTTR) rates of problems/alerts. This would imply building, maintaining, and continuously improving/annotating a data store of insights on which ML models are trained and generalized across the whole customer base and corporate cloud services. Our ongoing study aligns with this vision and validates an approach that learns the alert resolution patterns in such a global setting and explains them using interpretable AI methodologies. The knowledge store of causative rules is then applied to predicting potential sources of the application degradation reflected in an active alert instance. In this communication, we share our experiences with a prototype solution and up-to-date analysis demonstrating how root conditions are discovered accurately for a specific type of problem. It is validated against the historical data of resolutions performed by heavy manual development efforts. We also offer experts a Dempster–Shafer theory-based rule verification framework as a what-if analysis tool to test their hypotheses about the underlying environment.
Read full abstract