A Markov Random Field Based Approach for Analyzing Supercomputer System Logs

Thomas Hacker,Rui Pais,Chunming Rong

doi:10.1109/tcc.2017.2678473

Thomas Hacker, Rui Pais + Show 1 more

Open Access

https://doi.org/10.1109/tcc.2017.2678473

Copy DOI

Abstract

High performance computing systems comprised of hundreds or thousands of computational nodes can generate a high volume of system log entries at a high data velocity. Analyzing these logs soon after they are generated is a significant challenge, due to the complexity of log messages, the speed at which they are produced, and the lack of a method to quickly map or categorize messages to meaningful sets. The impact of this problem is that it is not possible to comprehensively glean timely information from logs about the overall system or the health of individual nodes. In this paper, we address this problem through the development of a novel approach for system log analysis based on a markov random field (MRF) that can quickly categorize system log messages into multiple categories based on representative training examples provided by a user. We present a theoretical model of our approach, followed by an extensive evaluation of the accuracy and performance of the implementation of our model. We found that our MRF based approach can quickly categorize system log messages with a high degree of accuracy.

Full Text