CETP: A novel semi-supervised framework based on contrastive pre-training for imbalanced encrypted traffic classification

Xinjie Lin,Longtao He,Gaopeng Gou,Jing Yu,Zhong Guan,Xiang Li,Juncheng Guo,Gang Xiong

doi:10.1016/j.cose.2024.103892

Abstract

Encrypted traffic classification (ETC) requires differentiated and robust traffic representation captured from content-agnostic and imbalanced traffic data for accurate classification, which is challenging but indispensable for enabling network security and network management. Some existing deep-learning based ETC approaches have achieved promising results, but have limitations in real-world network environments: 1) label bias caused by traffic class imbalance and 2) traffic homogeneity due to component sharing. How to leverage open-domain unlabeled imbalanced traffic data to learn representation with strong generalization ability remains a key challenge. In this paper, we propose a novel imbalanced traffic representation model, called Contrastive Encrypted Traffic Pre-training (CETP), which pre-trains deep multi-granularity traffic representation from imbalanced data without directly using application labels. The pre-trained model can be further mitigated against label bias due to imbalance by semi-supervised continual fine-tuning via pseudo-label iterations and dynamic loss-weighting algorithms. CETP achieves state-of-the-art performance across four imbalanced encrypted traffic classification tasks, remarkably improving the F1 to 96.31% (2.74%↑) for CP-Android, 93.86% (3.58%↑) for CIC-2019, and 84.16% (10.19%↑) for ISCX-VPN. Meanwhile, we further validate the effectiveness of CETP in QUIC-based imbalanced encrypted traffic. Notably, we verify through analytical experiments that CETP not only effectively relieves label bias and homogeneous flow misclassification, but also extends to ETC methods with diverse feature extractors.

Full Text