Feature Extraction for Payload Classification: A Byte Pair Encoding Algorithm

Tianci Xu,Peng Zhou

doi:10.1109/iccc56324.2022.10065977

Abstract

Payload classification is a kind of deep packet inspection model that has been proved effective for many Internet applications such as, but not limited to, intrusion detection and network diagnostics. In typical payload classification, feature extraction is the first and very important step which makes a great impact on the quality and quantity of classification results. At present, most feature extraction of payloads adopts n-gram model. However, n-gram model generates features in fixed-length (length of n), which may induce kinds of information loss for feature extraction. In this paper, we propose a very different Byte Pair Encoding (BPE) algorithm for payload feature extractions. In this algorithm, we introduce a novel concept of sub-words to express the payload features, and thereby have the feature length not fixed any more. By the BPE, we can first initialize a vocabulary in a single byte basis, and then continuously update the vocabulary by merging the most frequent byte pairs in the payload to form new sub-words until all sub-word pairs reach the (approximately) same frequency, regardless the lengths of these sub-words. We finally have a very flexible and scalable vocabulary for feature extraction and payload embedding. At the end, we conduct sets of payload classification experiments on the CIC-IDS2017 dataset, in order to verify the effectiveness of our algorithm. The results have successfully confirmed the better classification performance by the use of our BPE algorithm than the traditional n-gram methods.

Full Text