Abstract

With the growing demand for data computation and communication, the size and complexity of communication networks have grown significantly. However, due to hardware and software problems, in a large-scale communication network (e.g., telecommunication network), the daily alarm events are massive, e.g., millions of alarms occur in a serious failure, which contains crucial information such as the time, content, and device of exceptions. With the expansion of the communication network, the number of components and their interactions become more complex, leading to numerous alarm events and complex alarm propagation. Moreover, these alarm events are redundant and consume much effort to resolve. To reduce alarms and pinpoint root causes from them, we propose a data-driven and unsupervised alarm analysis framework, which can effectively compress massive alarm events and improve the efficiency of root cause localization. In our framework, an offline learning procedure obtains results of association reduction based on a period of historical alarms. Then, an online analysis procedure matches and compresses real-time alarms and generates root cause groups. The evaluation is based on real communication network alarms from telecom operators, and the results show that our method can associate and reduce communication network alarms with an accuracy of more than 91%, reducing more than 62% of redundant alarms. In addition, we validate it on fault data coming from a microservices system, and it achieves an accuracy of 95% in root cause location. Compared with existing methods, the proposed method is more suitable for operation and maintenance analysis in communication networks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call