Improving the system log analysis with language model and semi-supervised classifier

Guofu Li,Mei Wu,Zhiyi Chen,Ning Cao,Guangsheng Cao,Hongjun Li,Chenjing Gong,Pengjia Zhu

doi:10.1007/s11042-018-7020-3

Abstract

Mining the vast amount of server-side logging data is an essential step to boost the business intelligence, as well as to facilitate the system maintenance for multimedia or IoT oriented services. Considering the vast volume of the data repository, designers of these logging-data analysis systems need to carefully balance the speed of the processing and the accuracy of the message classification. Conventional keyword-based log data monitoring and classification is sufficiently fast, but does not scale well in complex systems, especially when the target system is contributed by a large group of developers, each may differ in the way to encode the logging messages, and often carrying misleading labels. Conversely, many of the sophisticated approaches may suffer from their considerable time consumption, such that delayed processing jobs may begin to accumulate, and can hardly support the timely decision requirements. Meanwhile, we also suggest that the design of a large scale online log analysis should follow a principle that requires the least prior knowledge, in which unsupervised or semi-supervised solution is preferred. In this paper, we propose a two-stage machine learning based method, in which the system logs are regarded as the output of a quasi-natural language, pre-filtered by a perplexity score threshold, and then undergo a fine-grained classification procedure. Empirical studies on our web-services show that our method has obvious advantage in terms of processing speed and classification accuracy.

Full Text