Abstract

In parallel with the rapid adoption of transport layer security (TLS), malware has utilized the encrypted communication channel provided by TLS to hinder detection from network traffic. To this end, recent research efforts are directed toward malware detection and malware family classification for TLS-encrypted traffic. However, amongst their feature sets, the proposals to utilize the sequential information of each TLS session has not been properly evaluated, especially in the context of malware family classification. In this context, we propose a systematic framework to evaluate the state-of-the-art malware family classification methods for TLS-encrypted traffic in a controlled environment and discuss the advantages and limitations of the methods comprehensively. In particular, our experimental results for the 10 representations and classifier combinations show that the graph-based representation for the sequential information achieves better performance regardless of the evaluated classification algorithms. With our framework and findings, researchers can design better machine learning based classifiers.

Highlights

  • Shen et al [11] proposed the notion of the traffic interaction graph (TIG) to represent a packet length sequence with directions and introduced graph neural networkbased representation learning for distributed application classification, called GraphDApp

  • While a majority of recent malware detection and malware family classification methods utilize a subset of the enhanced flow features, which can be exported by network devices [20], collecting such features may be inefficient in some scenarios, especially when there is no careful feature selection (e.g., [34])

  • For the classification accuracy ranking, most of the results can be expected from existing research efforts [8,9,11], the efforts mainly focus on application classification and malware detection

Read more

Summary

Introduction

While the secure sockets layer (SSL), an encryption protocol designed for web applications, has been used with the broad adoption of the internet in the 1990s, the adoption of SSL and its successor transport layer security (TLS) was less than half of the web traffic until the mid 2010s [1]. While feature representation and learning for the classifiers are important issues in machine learning applications [12], existing research efforts in malware family classification rarely report the performance comparison among different feature representations and learning approaches To this end, in this article, we propose a systematic framework to evaluate malware family classification methods for TLS-encrypted traffic in a controlled environment. To evaluate the existing research efforts with different feature representation and learning fairly in a common environment, we utilize the framework to extract a common flow-level feature (i.e., flow length sequence and directions) from TLS-encrypted traffic and evaluate several malware family classification methods.

Backgrounds and Related Work
Early Encrypted Traffic Classification Methods
Exploiting Sequential Information of TLS Flow
Fine-Grained Classification for TLS-Encrypted Traffic in Mobile Apps
Malware Detection and Family Classification from TLS-Encrypted Traffic
Lack of Malware Family Dataset
Need for Evaluation Based on Packet Length Sequences
Framework Overview
Feature Representations
Classification Algorithms
Traffic Dataset
Accuracy and F1 Score of the State-of-the-Art Methods
Confusion Matrices without Noisy Labels
ROC Curves and AUC Values
Non-Parametric Friedman Test and Post-Hoc Nemenyi Test
Training Time and Testing Time
Performance Evaluation with Noisy Labels
Discussion
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call