Abstract

As a solution to detect and analyse unknown or proprietary protocols, Protocol Reverse Engineering(PRE) has been developed swiftly in recent years. In this field, message clustering aimed at protocol format serves as a fundamental solution for differentiating of unknown protocol messages. This paper works on the problem of format-oriented message clustering of unknown protocols, including messages from proprietary or non-cooperative network environments with their specifications unknown. By introducing basic rules of ABNF, we define Token Format Distance (TFD) and Message Format Distance (MFD) to represent format similarity of tokens and messages, and introduce Jaccard Distance and an optimized sequence alignment algorithm (MFD measurement) to compute them. Then, a distance matrix is built by MFD and we feed it to DBSCAN algorithm to cluster unknown protocol messages into classes with different formats. In this process, we design an unsupervised clustering strategy with Silhouette Coefficient and Dunn Index applied to parameter selecting of DBSCAN. In experiment on two datasets, the harmonic average v-measures of homogeneity and completeness on result clusters are both above 0.91, with fmis and coverages no less than 0.97. Together with iqr of v-measure and fmi bellow 0.1 and 0.03 separately in boxplot analyses, this method is proved to have remarkable validity and stability. Comprehensive analyses and comparisons on these indexes also show considerable advantages of our method over previous work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call