Machine Learning With Variational AutoEncoder for Imbalanced Datasets in Intrusion Detection

Ying-Dar Lin,Ren-Hung Hwang,Po-Ching Lin,Van-Linh Nguyen,Yuan-Cheng Lai,Zi-Qiang Liu

doi:10.1109/access.2022.3149295

Abstract

As a result of the explosion of security attacks and the complexity of modern networks, machine learning (ML) has recently become the favored approach for intrusion detection systems (IDS). However, the ML approach usually faces three challenges: massive attack variants, imbalanced data issues, and appropriate data segmentation. Improper handling of the issues will significantly degrade ML performance, e.g., resulting in high false-negative and low recall rates. Despite many efforts have done in the literature, detecting security attacks in a complicated network environment with imperfect data collection is still an open issue. This work proposes a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">machine learning</i> framework with a combination of a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">variational autoencoder</i> and <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">multilayer perceptron</i> model to deal with imbalanced datasets and detect the explosion of attack variants on the Internet. The detection engine also includes an efficient <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">range-based sequential search</i> algorithm to address the segmentation challenge in data pre-processing from multiple sources (network packets, system/statistic logs) effectively. Our work is the first attempt to demonstrate the effect of using an appropriate combination of ML models for boosting IDS detection capability in a heterogeneous environment, where data collection imperfection is common. Experimental results on a public system log dataset (e.g., HDFS) show that our method gains approximately as much as 97% on F1 score and 98% on recall rate, a promising result compared to the same measurement of other solutions. Even better, we found that the proposed treatment of imbalanced datasets can improve up to 35% on the F1 score and 27% on recall rate. The testing results also indicate that our model can detect new attack variants.

Highlights

Zero-day vulnerabilities have been the headache of security protection systems for decades, in susceptible networks
WORK In this work, we present a prospective machine learning (ML)-based framework to detect security attacks and their variants, even with imbalanced datasets in training
Our work aims to deal with three challenges: massive attack variants, imbalanced data issues, and effective data segmentation

Summary

Introduction

Zero-day vulnerabilities have been the headache of security protection systems for decades, in susceptible networks. If the attacks come from a zero-day vulnerability, this kind of IDS will likely fail to detect them. Another common solution is to use anomaly detection with various baselines [4]. The detection engines can find out the abnormal behavior by checking whether the traffic pattern is far from a defined “normal” profile. This approach comes with a cost of high false-positive rate. Our IDS supports data from both system logs and network packets during the training. A fixed-length is used in structuring system logs

Methods

Results

Conclusion