Investigating rarity in web attacks with ensemble learners

Richard Zuech,Taghi M Khoshgoftaar,John Hancock

doi:10.1186/s40537-021-00462-6

Richard Zuech, Taghi M Khoshgoftaar + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s40537-021-00462-6

Copy DOI

Export

Save

Cite

Journal: Journal of Big Data	Publication Date: May 20, 2021
Citations: 7	License type: open-access

Affiliation: Florida Atlantic University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Class rarity is a frequent challenge in cybersecurity. Rarity occurs when the positive (attack) class only has a small number of instances for machine learning classifiers to train upon, thus making it difficult for the classifiers to discriminate and learn from the positive class. To investigate rarity, we examine three individual web attacks in big data from the CSE-CIC-IDS2018 dataset: “Brute Force-Web”, “Brute Force-XSS”, and “SQL Injection”. These three individual web attacks are also severely imbalanced, and so we evaluate whether random undersampling (RUS) treatments can improve the classification performance for these three individual web attacks. The following eight different levels of RUS ratios are evaluated: no sampling, 999:1, 99:1, 95:5, 9:1, 3:1, 65:35, and 1:1. For measuring classification performance, Area Under the Receiver Operating Characteristic Curve (AUC) metrics are obtained for the following seven different classifiers: Random Forest (RF), CatBoost (CB), LightGBM (LGB), XGBoost (XGB), Decision Tree (DT), Naive Bayes (NB), and Logistic Regression (LR) (with the first four learners being ensemble learners and for comparison, the last three being single learners). We find that applying random undersampling does improve overall classification performance with the AUC metric in a statistically significant manner. Ensemble learners achieve the top AUC scores after massive undersampling is applied, but the ensemble learners break down and have poor performance (worse than NB and DT) when no sampling is applied to our unique and harsh experimental conditions of severe class imbalance and rarity.

Highlights

Cybersecurity is an important consideration for the modern Internet era, with consumers spending over $600 billion on e-commerce sales during 2019 in the United States [1]
This section is divided into three subsections for each of the 3 datasets we evaluated for our three different individual web attacks from CSE-CIC-IDS2018: Brute Force, Cross-site scripting (XSS), and SQL Injection from Table 2
A total of eight different sampling ratios are evaluated. These seven classifiers are evaluated in the following tables for our various random undersampling (RUS) ratios: Random Forest (RF), Logistic Regression (LR), XGB, CB, Naive Bayes (NB), Decision Tree (DT), and LGB

Summary

Introduction

Cybersecurity is an important consideration for the modern Internet era, with consumers spending over $600 billion on e-commerce sales during 2019 in the United States [1]. Results and discussion This section is divided into three subsections for each of the 3 datasets we evaluated for our three different individual web attacks from CSE-CIC-IDS2018: Brute Force, XSS, and SQL Injection from Table 2.

Results

Conclusion