Automated Root Cause Analysis with Observability Data - A Comprehensive Review

Site Reliability Engineers (Sre), Barclays, Usa ,Ankur Mahida

doi:10.47363/jeast/2023(5)230

Abstract

Identifying the root cause analysis is the key to the timely detection of errors in massive, multiple-functional software systems. Meanwhile, network development will become more intricate and non-transparent, leaving the human algorithm behind. The paper dedicates its resources to discussing ways to automate root cause analysis based on observability data, such as logs, metrics, and traces. Technologies including causal inference, anomaly detection, and pattern recognition are specialized techniques that allow us to identify the breach in the background of thousands of connected events. Data-driven tools in research and industry that consume observability data as input, uncover anomalies, model system topology, and rank probable root causes use these technologies. This should give the customer a shorter mean repair time, higher reliability, and security. Deeper adoption is perfect for chain management and the performance gains of individual developers. The coverage includes strategies of algorithmic and implementation of root cause analysis with the observability data. Related topics like service maps and anomalous numbers are also discussed where necessary. Scaling out the automatic diagnosis entity is investigated as an automated means to replace the manual ones faced with elaborate models.

Full Text