Three-stage data generation algorithm for multiclass network intrusion detection with highly imbalanced dataset

Kwok Tai Chui,Brij B Gupta,Priyanka Chaurasia,Varsha Arya,Ammar Almomani,Wadee Alhalabi

doi:10.1016/j.ijin.2023.08.001

Abstract

The Internet plays a crucial role in our daily routines. Ensuring cybersecurity to Internet users will provide a safe online environment. Automatic network intrusion detection (NID) using machine learning algorithms has recently received increased attention recently. The NID model is prone to bias towards the classes with more training samples due to highly imbalanced datasets across different types of attacks. The challenge in generating additional training data for minority classes is the generation of insufficient data. The study's purpose is to address this challenge, which extends the data generation ability by proposing a three-stage data generation algorithm using the synthetic minority over-sampling technique, a generative adversarial network (GAN), and a variational autoencoder. A convolutional neural network is employed to extract the representative features from the data, which were fed into a support vector machine with a customised kernel function. An ablation study evaluated the effectiveness of the three-stage data generation, feature extraction, and customised kernel. This was followed by a performance comparison between our study and existing studies. The findings revealed that the proposed NID model achieved an accuracy of 91.9%–96.2% in the four benchmark datasets. In addition, it outperformed existing methods such as GAN-based deep neural networks, conditional Wasserstein GAN-based stacked autoencoder, synthesised minority oversampling technique-based random forest, and variational autoencoder-based deep neural network, by 1.51%–28.4%.

Full Text