A Statistical and Cluster Analysis Exploratory Study of Snort Rules

Claude Turner,Anthony Joseph

doi:10.1016/j.procs.2017.09.023

Claude Turner, Anthony Joseph

Open Access

https://doi.org/10.1016/j.procs.2017.09.023

Copy DOI

Abstract

While many studies applying machine learning algorithms to signature based intrusion detection systems (IDSs) have focused on log analysis, another avenue that might have the potential to yield additional insight into the intrusion detection problem is analysis of the rules that lie at the heart of an IDS. A signature-based IDS is generally ineffective against zero-day attacks; however, an analysis of the types of attack signatures encountered by such a system over its history, could provide guidance about future attacks. This history is essentially encapsulated into the rules of the IDS itself. An examination of rules can reveal a variety of useful information about the kinds of traffic that a network considers to be malicious. This research performs a statistical and machine learning cluster analysis of Snort rules, with a focus on the network protocols used by the rules. It proceeds in two phases. In Phase 1, algorithms are developed to extract protocol information from Snort rules and to determine their distribution across rule sets. This component of the research shows that there are three major types of protocols used in Snort rules: Transmission Control Protocol (TCP), User Datagram Protocol (UDP) and Internet Control Message Protocol (ICMP). It also provides the frequency (or cardinality) of each such protocol per rule set. Phase 2, which focused on the default enabled rules of the latest Snort rule version, performs cluster analyses using the following three approaches: k-means algorithm, a hierarchical agglomerative clustering algorithm, and a density based clustering algorithm. This component of the research illustrates that with respect to the number of protocols per rule set, TCP use is dominant and that it is reasonable to divide Snort rule sets into three principal clusters. One cluster consists of a single rule set and is characterized by a preponderance of UDP protocol usage among its rules. A second cluster consists of four rule sets and is distinguished by each rule set containing more than 900 TCP-based rules. The last cluster consists of all other rulesets. TCP based rules for rule sets in this cluster are all below 500. Comparison among the three clustering algorithms using the silhouette metric, demonstrate them to be very effective, with negligible variation in performance.

Full Text