An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Jakrarin Therdphapiyanak,Krerk Piromsopa

doi:10.1109/ecticon.2013.6559650

Abstract

In this paper, we determined the appropriate number of clusters and the proper amount of entries for applying K-means clustering to TCPdump data set using Apache Mahout/Hadoop framework. We aim at finding suitable configuration for efficiently analyzing large data set in limited amount of time. Our implementation applied Hadoop for large-scale log analysis with data set from KDD'99 competition as test data. With the distributed system framework, we can analyze a whole data set of KDD'99 by first applying our preprocessing. In addition, we use an anomaly detection model for log analysis. A key challenge is to make anomaly detection work more accurately. For the Kmeans algorithm, a key challenge is to set the appropriate number of the initial cluster (K). Moreover, we discuss whether the number of entries in log files affects the accuracy and detection rate of the system or not. Therefore, our implementation and experimental results describe the appropriate number of cluster and the proper amount of entries in log files. Finally, we show the result of our experiments with accuracy rate and number of initial cluster (K) graph, ROC curve and detection rate and false alarm rate table.

Full Text