CICIDS-2017 Dataset Feature Analysis With Information Gain for Anomaly Detection

Kurniabudi Kurniabudi,Rahmat Budiarto,Deris Stiawan,Darmawijoyo Darmawijoyo,Alwi M Bamhdi,Mohd Yazid Bin Idris

doi:10.1109/access.2020.3009843

Abstract

Feature selection (FS) is one of the important tasks of data preprocessing in data analytics. The data with a large number of features will affect the computational complexity, increase a huge amount of resource usage and time consumption for data analytics. The objective of this study is to analyze relevant and significant features of huge network traffic to be used to improve the accuracy of traffic anomaly detection and to decrease its execution time. Information Gain is the most feature selection technique used in Intrusion Detection System (IDS) research. This study uses Information Gain, ranking and grouping the features according to the minimum weight values to select relevant and significant features, and then implements Random Forest (RF), Bayes Net (BN), Random Tree (RT), Naive Bayes (NB) and J48 classifier algorithms in experiments on CICIDS-2017 dataset. The experiment results show that the number of relevant and significant features yielded by Information Gain affects significantly the improvement of detection accuracy and execution time. Specifically, the Random Forest algorithm has the highest accuracy of 99.86% using the relevant selected features of 22, whereas the J48 classifier algorithm provides an accuracy of 99.87% using 52 relevant selected features with longer execution time.

Highlights

The anomaly-based intrusion detection is one of the techniques used to recognize zero-day attacks
Methods and measurements have been proposed that show the ability in improving detection accuracy including Chi-Square, Information Gain, Correlation Based with Naive Bayes and Decision Table Majority Classifier [12], Support Vector Machine (SVM) [13] and Random Forest [12]
The analysis considers the following parameters: true positive rate (TPR), false-positive rate (FPR), Precision, Recall, Accuracy, percentage of incorrectly classified, and execution time for the analysis. 10-fold cross-validation is used in this stage

Summary

Introduction

The anomaly-based intrusion detection is one of the techniques used to recognize zero-day attacks. Many research works that use feature selection techniques to improve the accuracy of anomaly detection have been carried out such as works in [7]–[11]. Methods and measurements have been proposed that show the ability in improving detection accuracy including Chi-Square, Information Gain, Correlation Based with Naive Bayes and Decision Table Majority Classifier [12], Support Vector Machine (SVM) [13] and Random Forest [12]. Those methods were not tested on a large dataset with a large number of features

Objectives

Methods

Results

Conclusion