Data Curation and Quality Evaluation for Machine Learning-Based Cyber Intrusion Detection

Ngan Tran,Jay Bhuyan,Junhua Ding,Haihua Chen

doi:10.1109/access.2022.3211313

Ngan Tran, Jay Bhuyan + Show 2 more

Open Access

https://doi.org/10.1109/access.2022.3211313

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 3	License type: CC BY 4.0

Affiliation: Tuskegee University, University of North Texas

Abstract

Intrusion detection is an essential task for protecting the cyber environment from attacks. Many studies have proposed sophisticated models to detect intrusions from a large amount of data, yet they ignore the fact that poor data quality has a direct impact on the performance of the intrusion detection systems. This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems. Then the data preparation workflow and data quality requirements for intrusion detection are discussed. In order to investigate how data quality affects model performance, we conducted experiments on 11 HIDS datasets using eight machine learning (ML) models and two pre-trained language models (BERT and GPT-2). The experimental results show: 1. BERT and GPT outperform the other models on all of the datasets. 2. The pre-trained models and the classic ML models behave differently when duplicate data and overlapped data are removed from a dataset. The pre-trained models are more capable of learning from duplicate and overlapped data compared to the classic ML models. 3. Removing overlaps and duplicates can improve the performances of the pre-trained models and the traditional ML models on most datasets used in this study. However, doing this can sometimes cause model performance to be decreased. 4. The reliability of model performance is affected when a testing data contain duplicates. 5. The overlapped rate between the normal and intrusion classes seems to have an inverse relationship to the pre-trained models’ performances on the intrusion detection task. Given the results, we discuss model selection in HIDS, and quality assurance in training data and testing data based on nine data quality dimensions.

Full Text