Data preprocessing for distance-based unsupervised Intrusion Detection

D Said,P Federolf,K Barker,L Stirling

doi:10.1109/pst.2011.5971981

Abstract

Since Intrusion Detection Systems (IDSs) operate in real-time, they should be light-weighted to detect intrusions as fast as possible. Distance-based Outlier Detection (DBOD) is one of the most widely-used techniques for detecting outliers due to its simplicity and efficiency. Additionally, DBOD is an unsupervised approach which overcomes the problem of the lack of training datasets with known intrusions. However, since IDSs usually have high-dimensional datasets, using DBOD becomes subject to the curse of the dimensionality problem. Furthermore, intrusion datasets should be normalized before calculating pair-wise distance between observations. The purpose of this research is conduct a comparative study among different normalization methods in conjunction with a well-known feature extraction technique; Principle Component Analysis (PCA). Therefore, the efficiency of these methods as data preprocessing techniques can be investigated when applying DBOD to detect intrusions. Experiments were performed using two kinds of distance metrics; Euclidean distance and Mahalanobis distance. We further examined the PCA using 7 threshold values to indicate the number of Principle components to consider according to their total contribution in the variability of features. These approaches have been evaluated using the KDD Cup 1999 intrusion detection (KDD-Cup) dataset. The main purpose of this study is to find the best attribute normalization method along with the correct threshold value for PCA so that a fast unsupervised IDS can discover intrusions effectively. The results recommended using the Log normalization method combined the Euclidean distance while performing PCA.

Full Text