Anomaly Detection in Log Files Using Selected Natural Language Processing Methods

Piotr Ryciak,Artur Janicki,Katarzyna Wasielewska

doi:10.3390/app12105089

Piotr Ryciak, Artur Janicki + Show 1 more

Open Access

https://doi.org/10.3390/app12105089

Copy DOI

Journal: Applied Sciences	Publication Date: May 18, 2022
Citations: 7	License type: CC BY 4.0

Affiliation: Warsaw University of Technology

Abstract

In this article, we address the problem of detecting anomalies in system log files. Computer systems generate huge numbers of events, which are noted in event log files. While most of them report normal actions, an unusual entry may inform about a failure or malware infection. A human operator may easily miss such an entry; therefore, anomaly detection methods are used for this purpose. In our work, we used an approach known from the natural language processing (NLP) domain, which operates on so-called embeddings, that is vector representations of words or phrases. We describe an improved version of the LogEvent2Vec algorithm, proposed in 2020. In contrast to the original version, we propose a significant shortening of the analysis window, which both increased the accuracy of anomaly detection and made further analysis of suspicious sequences much easier. We experimented with various binary classifiers, such as decision trees or multilayer perceptrons (MLPs), and the Blue Gene/L dataset. We showed that selecting an optimal classifier (in this case, MLP) and a short log sequence gave very good results. The improved version of the algorithm yielded the best F1-score of 0.997, compared to 0.886 in the original version of the algorithm.

Full Text