The microservice architecture has already become the mainstream architecture pattern of web service applications in recent years. However, compared with traditional software architectures, the microservice architecture has a more sophisticated deployment structure, which makes it have to face more potential risks with greater diversity of fault symptoms. Microservice practitioners started to use the word 'resilience' to describe the capability of coping with different unexpected conditions. How to judge whether a system environment disruption is a risk of microservice resilience, and how to analyse resilience risks before the system is released, are the research questions in microservice development. As the practice of chaos engineering has solved the problem of resilience risk identification, this paper focuses on how to analyse identified resilience risks in microservice architecture systems, and a resilience risk analysis method is proposed. Based on performance monitoring data collected during chaos experiments, the analysis method uses the causality search algorithm to build causality graphs of performance indicators, and generates causality chains to system operators by the causality inference algorithm. The effectiveness of the proposed approach is proved by conducting a case study on a microservice architecture system.
Read full abstract