Abstract

With the rapid development of the Internet, a large number of private protocols emerge on the network. However, some of them are constructed by attackers to avoid being analyzed, posing a threat to computer network security. The blockchain uses the P2P protocol to implement various functions across the network. Furthermore, the P2P protocol format of blockchain may differ from the standard format specification, which leads to sniffing tools such as Wireshark and Fiddler not being able to recognize them. Therefore, the ability to distinguish different types of unknown network protocols is vital for network security. In this paper, we propose an unsupervised clustering algorithm based on maximum frequent sequences for binary protocols, which can distinguish various unknown protocols to provide support for analyzing unknown protocol formats. We mine the maximum frequent sequences of protocol message sets in bytes. And we calculate the fuzzy membership of the protocol message to each maximum frequent sequence, which is based on fuzzy set theory. Then we construct the fuzzy membership vector for each protocol message. Finally, we adopt K-means++ to split different types of protocol messages into several clusters and evaluate the performance by calculating homogeneity, integrity, and Fowlkes and Mallows Index (FMI). Besides, the clustering algorithms based on Needleman–Wunsch and the fixed-length prefix are compared with the algorithm presented in this paper. Compared with these traditional clustering methods, we demonstrate a certain improvement in the clustering performance of our work.

Highlights

  • Network protocol stipulates the format and sequence of the messages exchanged between the entities at the two ends of the communication in the computer network, and the entities receive the protocol messages and make appropriate actions

  • We propose an unsupervised protocol message clustering method based on the fuzzy membership of maximum frequent sequences, which effectively solves the problem of clustering difficulties caused by different protocol message lengths

  • Since a cluster represents a type of unknown protocol, each cluster is supposed to contain all of the corresponding protocol messages. To evaluate these factors quantitatively, we introduce the three commonly used indexes: homogeneity is used to measure the closeness of each cluster contains only one type of unknown protocol messages, completeness indicates how much the protocol messages of the same class are assigned to the same cluster, and Fowlkes and Mallows Index (FMI) is the overall evaluation of clustering performance

Read more

Summary

Introduction

Network protocol stipulates the format and sequence of the messages exchanged between the entities at the two ends of the communication in the computer network, and the entities receive the protocol messages and make appropriate actions. The workload of manually analyzing the format specifications of the unknown protocol is large, time-consuming, and error-prone. It takes 12 years for the SAMBA project to basically realize the extraction of the basic protocol specifications of the SMB protocol [10,11]. We propose an unsupervised protocol message clustering method based on the fuzzy membership of maximum frequent sequences, which effectively solves the problem of clustering difficulties caused by different protocol message lengths. 3. We introduce the number of protocol types as prior knowledge so that we can cluster the protocol messages, adjust different minimum support thresholds and calculate homogeneity, integrity and FMI to evaluate our method’s performance.

Related Work
Problem Statement
Unsupervised Clustering Based on Maximum Frequent Sequences
Maximum Frequent Sequence Mining
Fuzzy Membership Vector Construction and Protocol Message Clustering
Experiment and Analysis
Conclusion and Future Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call