Detecting Cybersecurity Attacks Using Different Network Features with LightGBM and XGBoost Learners

Joffrey L. Leevy,Richard Zuech,Taghi M. Khoshgoftaar,John Hancock

doi:10.1109/cogmi50398.2020.00032

Abstract

CSE-CIC-IDS2018 is an intrusion detection dataset containing roughly 16,000,000 normal and anomalous instances, with about 17% of these instances representing attack traffic. Our big data study has two parts, ensemble feature selection and model comparison. In the first part, we select features from the dataset for input to two classifiers that we employ in the second part. In the second part, we evaluate the performance of the classifiers with Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) and Fl-score. The outcome of our experiments enables us to answer three research questions. The first question is, “Does feature selection impact performance of classifiers in terms of AUC and Fl-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM in terms of AUC and Fl-score?” And, our third question is, “Does the choice of classifier: LightGBM or XGBoost, significantly impact performance in terms of AUC and Fl-score?” For CSE-CIC-IDS2018, we conclude that feature selection and classifier choice impact performance score, and Destination_Port is a significant feature for LightGBM. In our case study, we present the application and analysis of the impact of an ensemble feature selection technique. To the best of our knowledge, we are the first to apply this technique to the CSE-CIC-IDS2018 dataset.

Full Text