WCGAN-GP based synthetic attack data generation with GA based feature selection for IDS

Arpita Srivastava,Ditipriya Sinha,Vikash Kumar

doi:10.1016/j.cose.2023.103432

Abstract

Cyber-attack is one of the alarming issues in today's era. Firewalls, Intrusion Detection Systems (IDSs), and other techniques are popularly applied to prevent those attacks. Intrusion Detection System (IDS) works as the watchdog to monitor the network traffic and identify an alert when an unauthorized action is detected. However, most IDSs are designed on traditional datasets, which do not address the data imbalance problem. As a result, their performance is not good enough against minority class samples. In order to overcome the aforementioned problem and detect cyber-attacks with higher precision, this paper proposes a novel IDS applying Wasserstein Conditional Generative Adversarial Network – Gradient Penalty (WCGAN-GP) and Genetic Algorithm (GA). The objective of the two components- WCGAN-GP and Genetic Algorithm are introduced in the proposed work for synthetic data generation, feature optimization, and attack detection system. The proposed model generates synthetic samples that follow the underlying data distribution of actual data. This paper uses a novel fitness function for the convergence of the Genetic algorithm and produces an optimal feature vector for the classification problem. In this work, an extensive experimental study applying NSL-KDD and UNSW-NB15 datasets in combination with generated samples and reduced feature set from the proposed model is carried out to evaluate the performance of different machine learning models. Observations show that Xgboost gives the best performances among others. To justify using a genetic algorithm for feature optimization, this paper has compared the proposed model with other traditional feature selection approaches such as PCA, Autoencoder, and T-SNE on the proposed model. It is found that the proposed GA-based feature selection outperforms others. The reason behind it is that the GA-based feature selection technique offers certain advantages over PCA and Autoencoder-based dimensionality reduction techniques, such as the ability to handle non-linearity between features in the data, enhanced flexibility in feature selection, interpretability, and human insight, handling high-dimensional data, robustness to noise and outliers. It also offers an advantage over T-SNE, a nonlinear dimensionality reduction technique in terms of computational efficiency. Furthermore, the model is also compared with other state-of-the-art which address data imbalance problems. Our proposed model with Xgboost classifier gives better results than other existing approaches in terms of accuracy (95.54%), precision (92.61%), recall (95.54%), F1-score (93.41%), and false alarm rate (4.30%) on the NSL-KDD dataset and accuracy (89.58%), precision (89.46%), recall (89.58%), F1-score (88.89%) and false alarm rate (1.18%) on the UNSW-NB15 dataset.

Full Text