Resampling imbalanced data for network intrusion detection datasets

Sikha Bagui,Kunqi Li

doi:10.1186/s40537-020-00390-x

Sikha Bagui, Kunqi Li

Open Access

https://doi.org/10.1186/s40537-020-00390-x

Copy DOI

Journal: Journal of Big Data	Publication Date: Jan 6, 2021
Citations: 120	License type: open-access

Affiliation: University of West Florida

Abstract

Machine learning plays an increasingly significant role in the building of Network Intrusion Detection Systems. However, machine learning models trained with imbalanced cybersecurity data cannot recognize minority data, hence attacks, effectively. One way to address this issue is to use resampling, which adjusts the ratio between the different classes, making the data more balanced. This research looks at resampling’s influence on the performance of Artificial Neural Network multi-class classifiers. The resampling methods, random undersampling, random oversampling, random undersampling and random oversampling, random undersampling with Synthetic Minority Oversampling Technique, and random undersampling with Adaptive Synthetic Sampling Method were used on benchmark Cybersecurity datasets, KDD99, UNSW-NB15, UNSW-NB17 and UNSW-NB18. Macro precision, macro recall, macro F1-score were used to evaluate the results. The patterns found were: First, oversampling increases the training time and undersampling decreases the training time; second, if the data is extremely imbalanced, both oversampling and undersampling increase recall significantly; third, if the data is not extremely imbalanced, resampling will not have much of an impact; fourth, with resampling, mostly oversampling, more of the minority data (attacks) were detected.

Highlights

IntroductionIn order to detect Cyber-attacks, it is prudent that we build efficient Network Intrusion Detection Systems, and the basis for doing this is to be able to analyze network traffic flow data, termed here as Cybersecurity data, efficiently and quickly
Cybersecurity is increasingly becoming a major concern due to the increased reliance on computers and the Internet
The Artificial Neural Networks (ANN) classification was done in two modes: (i) on the Big Data framework using Spark’s Machine Learning Library; and (ii) using Scikit Learn on a local machine

Summary

Introduction

In order to detect Cyber-attacks, it is prudent that we build efficient Network Intrusion Detection Systems, and the basis for doing this is to be able to analyze network traffic flow data, termed here as Cybersecurity data, efficiently and quickly. There is an inherent problem with most network traffic flow data or Cybersecurity data—the data is highly imbalanced, that is, there is a disproportionately large amount of good or normal traffic data and, in a most cases, very few attack instances. Even existing benchmark datasets suffer from this problem. Using imbalanced data for machine learning or deep learning algorithms like Artificial Neural Networks (ANN) is a major challenge. Many of these datasets require multi-class classification

Results

Discussion

Conclusion