Impact of the Structure of Data Pre-processing Pipelines on the Performance of Classifiers When Applied to Imbalanced Network Intrusion Detection System Dataset

L. Guan,E. A. Edirisinghe,I. Al-Mandhari

doi:10.1007/978-3-030-29516-5_45

Abstract

The application of machine learning techniques for the purpose of network intrusion detection has become popular over the course of the last decade. Due to the nature of network intrusions the datasets available for training machine learning algorithms, i.e. classifiers, is imbalanced, due to some attacks being rare and some being frequent, in practice. For example, the most widely used network Intrusion Detection System (IDS) dataset is the KDD cup 99 dataset which is known to be an imbalanced dataset, meaning that there is a considerable imbalance amongst the number of occurrences of attacks (i.e. instances) in the different dataset classes. Thus, the resulting data complexity (e.g., irrelevant features, class imbalance) influences how effective a learning task would be when this dataset is used to train a machine learning classifier. In a typical machine learning based IDS a minimum of two pre-processing stages is utilized, i.e. data resampling and feature selection, within the system’s data pre-processing pipeline. The impact of data resampling and feature selection, separately on the performance accuracy of classifiers has been investigated in detail in literature. However, the question of whether feature selection should be performed after or before resampling methods for tackling imbalanced datasets such as the KDD cup dataset, has not been investigated. Further the impact of this order of algorithms within the data pre-processing pipeline, on the performance of different classifiers has also not been studied. This paper centres on the dual utilisation of resampling techniques and feature selection approaches within a data pre-processing pipeline of an IDS, and explores which one, when implemented in what order, would achieve the superior classification results for a given classifier. Seven feature selection methods are studied alongside a most widely used resampling technique. The impact on three widely used classification algorithms are investigated; Naive Bayes, Random Forest and Stacking. The performance of classifiers is examined in detail to determine which should come first, resampling or feature selection.

Full Text