Visualizing Realistic Benchmarked IDS Dataset: CIRA-CIC-DoHBrw-2020

Mohammad Hafiz Mohd Yusof,Osman Ahmed,Vladimir Shepelev,Akram A. Almohammedi

doi:10.1109/access.2022.3204690

Mohammad Hafiz Mohd Yusof, Osman Ahmed + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3204690

Copy DOI

Abstract

Intrusion Detection System (IDS) dataset is crucial to detect lateral movement of cyber-attacks. IDS dataset will help to train the IDS classifier model to achieve earliest detection. A good near-realism public dataset is essential to assist the development of advanced IDS classifier models. However, the available public IDS dataset has long been under scrutiny for its practicality to reflect real low-footprint cyber threats, render real-time network scenario, reflect recent malware attack over newly developed DoH protocol, disregard layer 3 information and finally publish contradictory results of classification and analysis between various studies which makes it non-reproducible and without shareable results. This problem can be resolved by sophisticatedly visualizing a new realistic, real-time, low footprint and up-to-date benchmarked dataset. Visualization helps to detect data deformation before designing the optimized and highly accurate classifier model. Therefore, this study aims to review a new realistic benchmarked IDS dataset and apply sophisticated technique to visualize them. The review starts by carefully examining production network features. These are then compared with various well-established public IDS datasets. Many of them are static, unrealistic meta-features and disregard source and destination Internet Protocol (IP) information except CIRA-CIC-DoHBrw-2020 dataset. The study then applies Eigen Centrality (EC) technique from the graph theory to visualize this layer 3 (L3) information. Finally, using various visualization techniques such as Principal Component Analysis (PCA) and Gaussian Mixture Model (GMM), the study further analyzes and subsequently visualizes the data. Results show that the CIRA-CIC-DoHBrw-2020 simulated recent malware attack and has a very imbalanced dataset which reflects the realistic low-footprint cyber-attacks. The centrality graph clearly visualizes IPs that are compromised by recent DoH attack in real-time, and the study concludes decisively that smaller packet length of size 1000 to 2000 bytes is to fit an attack trait.

Full Text