Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Microservice architecture has become a popular architecture adopted by many cloud applications. However, identifying the root cause of a failure in microservice systems is still a challenging and time-consuming task. In recent years, researchers have introduced various causal inference-based root cause analysis methods to assist engineers in identifying the root causes. To gain a better understanding of the current status of causal inference-based root cause analysis techniques for microservice systems, we conduct a comprehensive evaluation of nine causal discovery methods and twenty-one root cause analysis methods. Our evaluation aims to understand both the effectiveness and efficiency of causal inference-based root cause analysis methods, as well as other factors that affect their performance. Our experimental results and analyses indicate that no method stands out in all situations; each method tends to either fall short in effectiveness, efficiency, or shows sensitivity to specific parameters. Notably, the performance of root cause analysis methods on synthetic datasets may not accurately reflect their performance in real systems. Indeed, there is still a large room for further improvement. Furthermore, we also suggest possible future work based on our findings.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.54097/1gw77589
A Unified Framework for Anomaly Detection and Root Cause Analysis in Microservice Systems
  • Jul 15, 2025
  • Computer Life
  • Oliver Meyer + 2 more

Modern software applications increasingly rely on microservice architectures for scalability, flexibility, and rapid deployment. However, this architectural paradigm introduces new complexities in monitoring system behavior, identifying anomalies, and determining their root causes across distributed services. Existing solutions often address anomaly detection and root cause analysis (RCA) in isolation, leading to fragmented insights and delayed resolution. This paper proposes a unified framework that integrates real-time anomaly detection with automated RCA using machine learning and graph-based dependency modeling. The framework continuously monitors telemetry data—including metrics, logs, and traces—and applies an ensemble of statistical and deep learning models for multivariate anomaly detection. Detected anomalies are then contextualized through a service dependency graph and analyzed using causal inference techniques to identify the most probable root causes. We evaluate the framework on both synthetic benchmarks and real-world microservice deployments. Experimental results show that it achieves high precision and recall in anomaly detection while significantly reducing RCA latency compared to baseline methods. By combining anomaly detection and RCA in a cohesive pipeline, the proposed framework enhances system observability and reduces mean time to recovery (MTTR), thus improving operational resilience in complex microservice environments.

  • Conference Article
  • Cite Count Icon 52
  • 10.1109/cloudintelligence52565.2021.00015
MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems
  • May 1, 2021
  • Li Wu + 4 more

Microservice architecture has emerged as a popular pattern for developing large-scale applications for its benefits of flexibility, scalability, and agility. However, the large number of services and complex dependencies make it difficult and time-consuming to diagnose performance issues. We propose Micro-Diag, an automated system to localize root causes of performance issues in microservice systems at a fine granularity, including not only locating the faulty component but also discovering detailed information for its abnormality. MicroDiag constructs a component dependency graph and performs causal inference on diverse anomaly symptoms to derive a metrics causality graph, which is used to infer root causes. Our experimental evaluation on a microservice benchmark running in a Kubernetes cluster shows that MicroDiag localizes root causes well, with 97% precision of the top 3 most likely root causes, outperforming state-of-the-art methods by at least 31.1%.

  • Research Article
  • 10.71465/csb161
Multi-Layer Causal Graphs for Distributed System Performance: Modeling Cross-Service Dependencies in Microarchitectures
  • Dec 2, 2025
  • Computer Science Bulletin
  • Sichen Liu

Modern distributed systems based on microservice architectures face unprecedented challenges in performance management and fault diagnosis due to their inherent complexity and dynamic nature. This paper presents a comprehensive framework for modeling cross-service dependencies using multi-layer causal graphs, enabling more accurate performance analysis and root cause localization in microservice environments. We propose a hierarchical approach that captures both inter-service and intra-service causal relationships across multiple abstraction layers, including the infrastructure layer, metric layer, and invocation layer. Our methodology integrates causal inference techniques with graph-based representations to construct dynamic dependency models that adapt to the evolving nature of microservice systems. Through systematic analysis of service invocation patterns, performance metrics, and infrastructure telemetry, we demonstrate how multi-layer causal graphs can effectively identify performance bottlenecks and trace anomaly propagation paths. The experimental evaluation on benchmark microservice applications reveals that our approach achieves superior accuracy in root cause localization compared to traditional single-layer methods, with an average precision improvement of 23% and recall enhancement of 18%. Furthermore, the proposed framework exhibits excellent scalability, maintaining consistent performance even as system complexity increases with additional services and dependencies.

  • Conference Article
  • Cite Count Icon 15
  • 10.1145/3430984.3431027
Evaluation of Causal Inference Techniques for AIOps
  • Jan 2, 2021
  • Vijay Arya + 5 more

Inferring causality of events from log data is critical to IT operations teams who continuously strive to identify probable root causes of events in order to quickly resolve incident tickets so that downtimes and service interruptions are kept to a minimum. Although prior work has applied some specific causal inference techniques on proprietary log data, they fail to benchmark the performance of different techniques on a common system or dataset. In this work, we evaluate the performance of multiple state-of-the-art causal inference techniques using log data obtained from a publicly available benchmark microservice system. We model log data both as a timeseries of error counts and as a temporal event sequence and evaluate 3 families of Granger causal techniques: regression based, independence testing based, and event models. Our preliminary results indicate that event models yield causal graphs that have high precision and recall in comparison to regression and independence testing based Granger methods.

  • Conference Article
  • 10.1109/cscwd61410.2024.10580445
MicroMCM: Fine-grained Root Cause Localization for Microservice Systems Based on Multiple Causal Inference Methods
  • May 8, 2024
  • Hanqing Gao + 3 more

Microservice architecture has become a prevalent approach for developing large-scale applications due to its scalability, flexibility, and agility. However, the large-scale deployment and frequent updates of microservices pose challenges for operational personnel in diagnosing performance issues. To address this, we propose MicroMCM, a framework that enables fine-grained, automated, and real-time root cause localization. MicroMCM dynamically selects different causal inference (CI) methods based on diverse anomaly patterns to construct causal graphs and utilizes root cause inference techniques to identity the root cause metrics. We conduct experiments for both coarse-grained and fine-grained root cause localization to evaluate the performance of MicroMCM. The results demonstrate that MicroMCM outperforms baseline methods, exhibiting superior localization capabilities.

Save Icon
Up Arrow
Open/Close