Abstract

Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system. In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publicly available, modern, and covers a wide range of realistic attack types. Our contribution is centered around answers to three research questions. The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” The third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” These research questions are all answered in the affirmative and provide valuable, practical information for the development of an efficient intrusion detection model. To the best of our knowledge, we are the first to use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset.

Highlights

  • CSE-CIC-IDS2018 [1], referred to as the 2018 dataset throughout this text, is an intrusion detection dataset with normal and anomalous instances of network traffic

  • Our contribution is defined by our responses to three research questions: The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” And, our third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” The answers to these research questions provide valuable and practical information for the development of an efficient intrusion detection model

  • The first results we report are the mean AUC and F1-scores for CatBoost, LightGBM, and CategoricalNB with their respective datasets

Read more

Summary

Introduction

CSE-CIC-IDS2018 [1], referred to as the 2018 dataset throughout this text, is an intrusion detection dataset with normal and anomalous instances of network traffic. Machine learning models efficiently trained on CSE-CIC-IDS2018 can detect network traffic capable of compromising an information system. This dataset is the most recent iteration of ISCXIDS2012 [2], a scalable project designed to produce modern, realistic datasets. CSE-CIC-IDS2018 data originated from an extensive network of victim and attack machines [3], yielding an aggregate of 16,233,002 instances. Six classes of attack traffic (percentage distribution shown in Table 1) are represented by about 17% of these instances. The dataset is distributed over ten CSV files that are

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.