Feature Selection and Ensemble-Based Intrusion Detection System: An Efficient and Comprehensive Approach

Ebrima Jaw,Xueming Wang

doi:10.3390/sym13101764

Abstract

The emergence of ground-breaking technologies such as artificial intelligence, cloud computing, big data powered by the Internet, and its highly valued real-world applications consisting of symmetric and asymmetric data distributions, has significantly changed our lives in many positive aspects. However, it equally comes with the current catastrophic daily escalating cyberattacks. Thus, raising the need for researchers to harness the innovative strengths of machine learning to design and implement intrusion detection systems (IDSs) to help mitigate these unfortunate cyber threats. Nevertheless, trustworthy and effective IDSs is a challenge due to low accuracy engendered by vast, irrelevant, and redundant features; inept detection of all types of novel attacks by individual machine learning classifiers; costly and faulty use of labeled training datasets cum significant false alarm rates (FAR) and the excessive model building and testing time. Therefore, this paper proposed a promising hybrid feature selection (HFS) with an ensemble classifier, which efficiently selects relevant features and provides consistent attack classification. Initially, we harness the various strengths of CfsSubsetEval, genetic search, and a rule-based engine to effectively select subsets of features with high correlation, which considerably reduced the model complexity and enhanced the generalization of learning algorithms, both of which are symmetry learning attributes. Moreover, using a voting method and average of probabilities, we present an ensemble classifier that used K-means, One-Class SVM, DBSCAN, and Expectation-Maximization, abbreviated (KODE) as an enhanced classifier that consistently classifies the asymmetric probability distributions between malicious and normal instances. HFS-KODE achieves remarkable results using 10-fold cross-validation, CIC-IDS2017, NSL-KDD, and UNSW-NB15 datasets and various metrics. For example, it outclassed all the selected individual classification methods, cutting-edge feature selection, and some current IDSs techniques with an excellent performance accuracy of 99.99%, 99.73%, and 99.997%, and a detection rate of 99.75%, 96.64%, and 99.93% for CIC-IDS2017, NSL-KDD, and UNSW-NB15, respectively based on only 11, 8, 13 selected relevant features from the above datasets. Finally, considering the drastically reduced FAR and time, coupled with no need for labeled datasets, it is self-evident that HFS-KODE proves to have a remarkable performance compared to many current approaches.

Highlights

This section presents a detailed systematic analysis and performance validation of the proposed system. Measuring how well it can efficiently select the relevant features among thousands of records and use these few selected features to classify network traffics into either benign or malicious traffic accurately. It provides a thorough performance evaluation of the individual classifiers and KODE on each of the three raw datasets, the selected features, various combination methods, and state-of-the-earth approaches based on numerous metrics such as false alarm rates (FAR), ACC, Detection rate (DR), Precision, F1-measure, MBT, and MTT
To mitigate the challenges of data sampling, the k-fold cross validation parts are set to explicit limits of testing and training percentages that are not apparent in the training stage to test the model’s quality, generalizability, and reliability
Existing studies have shown that the curse of dimensionality from the unbalanced network traffic, low detection rates, high false alarm rates, low accuracy, and the difficulty to attain sufficient labeled datasets remain a challenge

Summary

Introduction

According to McKinsey in reference [2], the competitive edge in the global market is currently driven by harnessing efficient and productive big data and cuttingedge technologies. Making these infrastructures attracts attention from the government and the business industries, and from illegal attempts to access these sensitive and valuable data. These valuables data are generally partial in many big data and real-world applications, categorized into asymmetric and symmetric data distributions. Establishing an efficient and effective means of filtering these valuable patterns is significant [3]

Objectives

Methods

Results

Conclusion