Abstract

AbstractIn terms of cyber security, log files represent a rich source of information regarding the state of a computer service/system. Automating the process of summarizing log file content represents an important aid for decision‐making, especially given the 24/7 nature of network/service operations. We perform benchmarking over eight distinct log files in order to assess the impact of the following: (1) different embedding methods for developing semantic descriptions of the original log files, (2) applying dimension reduction to the high‐dimensional semantic space, and (3) assessing the impact of using different unsupervised learning algorithms for providing a visual summary of the service state. Benchmarking demonstrates that (1) word‐to‐vector embeddings identified by bidirectional encoder representation from transformers (BERT) without “fine‐tuning” are sufficient to match the performance of Bag‐or‐Words embeddings provided by term frequency‐inverse document frequency (TF‐IDF) and (2) the self‐organizing map without dimension reduction provides the most effective anomaly detector.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call