Effective Anomaly Detection Model Training with only Unlabeled Data by Weakly Supervised Learning Techniques

Wenzhuo Yang,Kwok-Yan Lam

doi:10.1007/978-3-030-86890-1_23

Abstract

Intrusion detection systems (IDS) play an important role in security monitoring to identify anomalous or suspicious activities. Traditional IDS could be signature-based (or rule-based) or anomaly-based (or analytics-based). With the objectives of detecting zero-day attacks, analytics-based IDS have attracted great interest of the cybersecurity community. Furthermore, machine learning (ML) techniques have been extensively explored for advancing analytics-based IDS. Many ML techniques have been studied to improve the efficiency of intrusion detection and some have shown good performance. However, traditional supervised learning algorithms need strong supervision information, fully correctly labeled (FCL) data, to train an accurate model. Whereas, with the rapid development of network and communication technologies, the volume of network traffic and system logs has increased drastically in recent years, especially with the introduction of Next Generation Broadband Network (NGBN) and 5G networks. This caused huge pressure on analytics-based IDS because, for ML to train predictive models, security-relevant data need to be labeled manually, hence leading to practical barriers to achieving effective IDS. In order to avoid being overly dependent on strong supervision information, weakly supervised learning techniques, which utilize incomplete, inexact, or possibly inaccurate labels, have been studied by cybersecurity researchers in that such weak supervision information are easier and cheaper to obtain than FCL data. This research aims to explore the feasibility of weakly supervised learning techniques in IDS tasks so as to reduce the reliance on a massive amount of strong supervision information, which will only continue to grow tremendously in the big data society. We also investigated the detection stability of the proposed scheme when inaccurate weak supervision information is provided. In this article, we propose an IDS model training scheme that is based on a weakly supervised learning algorithm, which requires only unlabeled data. Experiments have been performed on three publicly available IDS evaluation datasets. The results showed that the proposed scheme performs well and is even better than some supervised learning-based IDS (SL-IDS) models. Experimental results also indicated that the weakly supervised learning based IDS model is robust and can be applied in real world situations. Besides, we examined detection performance of the proposed method when it faces class-imbalanced training data and the experiment results show that it performs better than the compared methods.

Full Text