Encrypted network traffic classification based on machine learning

Reham T Elmaghraby,Nada M Abdel Aziem,Mohammed A Sobh,Ayman M Bahaa-Eldin

doi:10.1016/j.asej.2023.102361

Abstract

Encrypted traffic is an essential part of maintaining the security and privacy of data transmission. It plays an important role in keeping our networks secure by preventing attackers from intercepting confidential information, which they may access without authorization; However, its effectiveness relies heavily on accurate classification techniques being applied correctly, so we can differentiate between legitimate users' activities versus those attempting malicious activity within the networks’ boundaries. Encrypted network traffic is becoming increasingly common in modern communication systems, presenting a challenge for effective network management and security. To address this challenge, machine learning models have been employed to classify encrypted traffic but with limited success due to the lack of clear visibility into packet contents and an inability to inspect their content. For the sake of tackling this issue, more effective research has begun on developing machine learning models for classifying encrypted payloads without relying on inspecting their contents directly. This research will investigate how features like packet length, time stamps or transport layer security (TLS) and encrypted payload information can be used as input features when attempting classification tasks, instead of analyzing unencrypted content directly from packets themselves which would otherwise be impossible given the current technology constraints. The evaluation process will focus on assessing different model architectures, as well as feature selection techniques that yield improved results over the existing approaches. In this paper, we proposed three approaches to identify encrypted traffic and classify different applications such as browsing, VOIP, file transfer and video streaming. The first two techniques consist of two stages: the first stage is either a neural network or a bi-directional LSTM, and the second stage is a selection of different classification techniques, namely Random Forest, Support vector machine, Linear regression, and K-nearest neighbor. The final result is achieved using an ensemble voting technique. As for the third technique, the network packets are grouped together by Source IP, destination IP and session time before feeding them into three different combinations of LSTM networks; either coupled with convolution 1D or 2D layers, or without. Like the first two techniques, the final result is achieved by means of ensemble voting. Through extensive comparison between the three approaches, The first approach yielded the highest accuracy. However, the performance of the second and third techniques in terms of time complexity was superior. The achieved accuracies were 96.8%, 95.2% and 96.5% for the proposed techniques, respectively.

Full Text