Abstract

Clustering have been proven to be an effective technique for finding data instances with similar characteristics. Such algorithms are based on the notion of distance between data points, often computed using Euclidean metric. That is why, clustering algorithms are mostly applicable to the data sets comprising of numerical values. However, the real life data often consist of features which are categorical in nature. For example, to identify abnormal behavior or a cyberattack in a network, we usually examine packet headers which contain categorical values such as source and destination IP addresses, source and destination port numbers, upper layer protocols, etc. Euclidean metric is not applicable to such data sets because it cannot compute the distance between categorical variables. To address this problem, similarity functions have been designed to determine the relationship between given categorical values. Similarity defines how closely related the objects are to one another. Often similarity could be thought of as opposite to distance where similar objects have high value, while dissimilar objects have low or zero value. In this paper we explored accuracy of various similarity functions using the Partitioning Around Medoids (PAM) clustering algorithm. We tested similarity functions on several data sets to determine their ability to correctly predict the class labels. We also examined the applicability of various similarity functions to different types of data sets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call