Assessing the impact of bag‐of‐words versus word‐to‐vector embedding methods and dimension reduction on anomaly detection from log files

Ziyu Qiu,Bradley Niblett,Nur Zincir‐Heywood,Zhilei Zhou,Andrew Johnston,Jeffrey Schwartzentruber,Malcolm I Heywood

doi:10.1002/nem.2251

Abstract

AbstractIn terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision‐making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high‐dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word‐to‐vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine‐tuning” are sufficient to match the performance of Bag‐or‐Words embeddings provided by term frequency‐inverse document frequency (TF‐IDF) and (2) the self‐organizing map without dimension reduction provides the most effective anomaly detector.

Full Text