Traffic classification, a mapping of traffic to network applications, is important for a variety of networking and security issues, such as network measurement, network monitoring, as well as the detection of malware activities. In this paper, we propose Securitas, a network trace-based protocol identification system, which exploits the semantic information in protocol message formats. Securitas requires no prior knowledge of protocol specifications. Deeming a protocol as a language between two processes, our approach is based upon the new insight that the n-grams of protocol traces, just like those of natural languages, exhibit highly skewed frequency-rank distribution that can be leveraged in the context of protocol identification. In Securitas, we first extract the statistical protocol message formats by clustering n-grams with the same semantics, and then use the corresponding statistical formats to classify raw network traces. Our tool involves the following key features: 1) applicable to both connection oriented protocols and connection less protocols; 2) suitable for both text and binary protocols; 3) no need to assemble IP packets into TCP or UDP flows; and 4) effective for both long-live flows and short-live flows. We implement Securitas and conduct extensive evaluations on real-world network traces containing both textual and binary protocols. Our experimental results on BitTorrent, CIFS/SMB, DNS, FTP, PPLIVE, SIP, and SMTP traces show that Securitas has the ability to accurately identify the network traces of the target application protocol with an average recall of about 97.4% and an average precision of about 98.4%. Our experimental results prove Securitas is a robust system, and meanwhile displaying a competitive performance in practice.
Read full abstract