Fault localization in online microservices is a challenging task due to the vast amount of monitoring data, diversity of types and events, and complex interdependencies among services and components. Fault events in services are propagative and can trigger a cascade of faults in a short period of time. In the industry, fault localization is typically conducted manually by experienced personnel. This reliance on experience is unreliable and lacks automation. Different modules present information barriers during manual localization, making it difficult to quickly align during urgent faults. This inefficiency lags stability assurance to minimize fault detection and repair time. Although actionable methods aimed to automate the process, the accuracy and efficiency are less than satisfactory. The precision of fault localization results is of paramount importance as it underpins engineers’ trust in the diagnostic conclusions, which are derived from multiple perspectives and offer comprehensive insights. Therefore, a more reliable method is required to automatically identify the associative relationships among fault events and propagation paths. To achieve this, a knowledge graph-enhanced root cause analysis (KGroot) method is designed for efficient and effective diagnosis of recurring failures in complex microservices environments. As the first event-driven knowledge graph method, KGroot uses event knowledge and the correlation between events to perform root cause reasoning for Root Cause Analysis (RCA). A Fault Event Knowledge Graph (FEKG) is built based on historical data, an online graph is constructed in real-time when a failure event occurs, and the similarity between each event knowledge graph and online graph is compared using GCNs to pinpoint the fault type through a ranking strategy. Comprehensive experiments demonstrate that KGroot can locate the root cause with an accuracy of 93.5% top 3 potential causes in second-level. This performance matches the level of real-time fault diagnosis in the industrial environment and significantly surpasses state-of-the-art baselines in RCA in terms of effectiveness and efficiency. (KGroot is available at https://github.com/daixixiwang/KGroot).
Read full abstract