Abstract

AbstractSince the traffic of novel attacks exceeds current knowledge, realistic traffic labeling methods are prone to mislabeling, which has a significant impact on machine learning-based intrusion detection systems. Data cleaning typically relies on the ability of supervised deep neural networks to learn correct knowledge. Under high noise conditions, noisy labels can affect a supervised network and render it ineffective. To clean traffic datasets under high noise conditions, we propose an unsupervised learning-based data cleaning framework (called ULDC) that does not rely on labels and powerful supervised networks, hence reducing the impact of noisy labels. ULDC evaluates the confidence of observed labels through the distribution and similarity of samples in low dimensions. Moreover, ULDC maximizes the retention of hard samples through adaptive intra-class threshold evaluation, preserving more hard samples for training and improving generalization. In evaluations of ULDC on the CIRA-CIC-DoHBrw-2020 dataset, the percentage of data correction reached more than 75% under high noise, which is better than that of the state-of-the-art methods. ULDC is applicable to traffic data cleaning in both traditional networks and novel networks such as the Internet of Things and mobile networks, and it has been validated on datasets including CIC-IDS-2017 and IoT-23.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call