Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

João Henriques,Tiago Cruz,Paulo Simões,Filipe Caldeira

doi:10.3390/electronics9071164

João Henriques, Tiago Cruz + Show 2 more

Open Access

https://doi.org/10.3390/electronics9071164

Copy DOI

Journal: Electronics	Publication Date: Jul 17, 2020
Citations: 41	License type: CC BY 4.0

Affiliation: University of Coimbra, Polytechnic Institute of Viseu

Abstract

Computing and networking systems traditionally record their activity in log files, which have been used for multiple purposes, such as troubleshooting, accounting, post-incident analysis of security breaches, capacity planning and anomaly detection. In earlier systems those log files were processed manually by system administrators, or with the support of basic applications for filtering, compiling and pre-processing the logs for specific purposes. However, as the volume of these log files continues to grow (more logs per system, more systems per domain), it is becoming increasingly difficult to process those logs using traditional tools, especially for less straightforward purposes such as anomaly detection. On the other hand, as systems continue to become more complex, the potential of using large datasets built of logs from heterogeneous sources for detecting anomalies without prior domain knowledge becomes higher. Anomaly detection tools for such scenarios face two challenges. First, devising appropriate data analysis solutions for effectively detecting anomalies from large data sources, possibly without prior domain knowledge. Second, adopting data processing platforms able to cope with the large datasets and complex data analysis algorithms required for such purposes. In this paper we address those challenges by proposing an integrated scalable framework that aims at efficiently detecting anomalous events on large amounts of unlabeled data logs. Detection is supported by clustering and classification methods that take advantage of parallel computing environments. We validate our approach using the the well known NASA Hypertext Transfer Protocol (HTTP) logs datasets. Fourteen features were extracted in order to train a k-means model for separating anomalous and normal events in highly coherent clusters. A second model, making use of the XGBoost system implementing a gradient tree boosting algorithm, uses the previous binary clustered data for producing a set of simple interpretable rules. These rules represent the rationale for generalizing its application over a massive number of unseen events in a distributed computing environment. The classified anomaly events produced by our framework can be used, for instance, as candidates for further forensic and compliance auditing analysis in security management.

Highlights

Hosts and network systems typically record their detailed activity in log files with specific formats, which are valuable sources for anomaly detection systems
In the scope of the ATENA H2020 Project [1,2], we faced this challenge while building a Forensics and Compliance Auditing (FCA) tool able to handle all the logs produced by a typical energy utility infrastructure
We present the algorithms and tools we adopted in our work, namely k-means (Section 2.3), decision trees (Section 2.4), gradient tree boosting on XGBoost (Section 2.5) and Dask (Section 2.6)

Summary

Introduction

Hosts and network systems typically record their detailed activity in log files with specific formats, which are valuable sources for anomaly detection systems. In the scope of the ATENA H2020 Project [1,2], we faced this challenge while building a Forensics and Compliance Auditing (FCA) tool able to handle all the logs produced by a typical energy utility infrastructure To address such challenges, we researched novel integrated anomaly detection methods employing parallel processing capabilities for improving detection accuracy and efficiency over massive amounts of log records. We researched novel integrated anomaly detection methods employing parallel processing capabilities for improving detection accuracy and efficiency over massive amounts of log records These methods combine the k-means clustering algorithm [3] and the gradient tree boosting classification algorithm [4] to leverage the filtering capabilities over normal events, in order to concentrate the efforts on the remaining anomaly candidates.

Background and Related Work

Base Concepts

Related Work

K-Means

Decision Trees

XGBoost

Proposed Framework

Description of the Algorithm

Discussion and Evaluation

Feature Extraction and Data Exploration

Clustering

Classification

Parallelization

Findings

Discussion

Conclusions and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Cognitive Precursors to Science Comprehension
Kimberly G Cottrell ... Danielle S Mcnamara
-
Kimberly G Cottrell, et. al.Kimberly G Cottrell ... Danielle S Mcnamara
24 Apr 2019
24 Apr 2019

Automated Log Analysis and Anomaly Detection Using Machine Learning
Ali Hussain Shah ... Esmaeil Habib Zadeh
-
Ali Hussain Shah, et. al.Ali Hussain Shah ... Esmaeil Habib Zadeh
18 Oct 2022
18 Oct 2022

Adapting prior knowledge activation: Mobilisation, perspective taking, and learners’ prior knowledge
Sandra A.J Wetzels ... Jeroen J.G Van Merriënboer
Computers in Human Behavior | VOL. 27
Sandra A.J Wetzels, et. al.Sandra A.J Wetzels ... Jeroen J.G Van Merriënboer
31 May 2010
Computers in Human Behavior | VOL. 27

On optimal quantization and its effect on anomaly detection and image classification
...
-
, et. al. ...
01 Jan 2013
01 Jan 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining K-Means and XGBoost Models for Anomaly Detection Using Log Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics