Abstract

An effective anomaly-based intelligent IDS (AN-Intel-IDS) must detect both known and unknown attacks. Hence, there is a need to train AN-Intel-IDS using dynamically generated, real-time data in an adversarial setting. Unfortunately, the public datasets available to train AN-Intel-IDS are ineluctably static, unrealistic, and prone to obsolescence. Further, the need to protect private data and conceal sensitive data features has limited data sharing, thus encouraging the use of synthetic data for training predictive and intrusion detection models. However, synthetic data can be unrealistic and potentially bias. On the other hand, real-time data are realistic and current; however, it is inherently imbalanced due to the uneven distribution of anomalous and non-anomalous examples. In general, non-anomalous or normal examples are more frequent than anomalous or attack examples, thus leading to skewed distribution. While imbalanced data are commonly predominant in intrusion detection applications, it can lead to inaccurate predictions and degraded performance. Furthermore, the lack of real-time data produces potentially biased models that are less effective in predicting unknown attacks. Therefore, training AN-Intel-IDS using imbalanced and adversarial learning is instrumental to their efficacy and high performance. This paper investigates imbalanced learning and adversarial learning for training AN-Intel-IDS using a qualitative study. It surveys and synthesizes generative-based data augmentation techniques for addressing the uneven data distribution and generative-based adversarial techniques for generating synthetic yet realistic data in an adversarial setting using rapid review, structured reporting, and subgroup analysis.

Highlights

  • In a binary classification problem, such as anomaly-based detection, where the dataset contains two sets of examples, it is common to encounter class imbalance

  • The authors used the balanced data generated by the generative adversarial networks (GAN), which solved the problem of overfitting and overlapping by specifying the desired resampling rate, to train an anomaly-based detection model based on the random forest (RF) method by increasing the weight of the minority attack class in the intrusion detection evaluation dataset (CICIDS)

  • Our initial focus was to categorize the surveyed data-driven learning (DDL) methods and techniques into data augmentation and data generation based on the class of problem they are attempting to solve and into adversarial and non-adversarial learning based on their learning approach

Read more

Summary

Introduction

In a binary classification problem, such as anomaly-based detection, where the dataset contains two sets of examples (normal and anomalous), it is common to encounter class imbalance. Class imbalance generally occurs when the normal set contains significantly more examples or samples than the anomalous set, dividing the dataset into minority and majority class samples. Data imbalance or uneven class distribution can cause an AN-Intel-IDS model to over classify the normal class due to its high probability in the dataset compared to the anomalous one. Resampling techniques, which are often applicable before learning, adjust the minority class distribution to solve the data imbalance problem. Oversampling and undersampling techniques focus on balancing the distribution of the majority and minority classes in the dataset. While oversampling and undersampling reduce data imbalance using the same dataset, SMOTE, an intelligent data resampling technique, reduces the degree of imbalance by synthetically creating a new minority class [13]. Overfitting is less significant in SMOTE than overlapping, which results from interpolating between relatively adjacent instances of the minority class

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call