Categorization Performance of Unsupervised Learning Techniques for Web Robots Sessions

Dilip Singh Sisodia,Radhika Khandelwal,Arti Anuragi

doi:10.1109/icirca.2018.8597200

Abstract

Web robots are automated software agents primarily used for web searching and indexing. Nowadays, Web robots are frequently used for performing malicious (spamming and spying, etc.) activities on the internet because of their camouflage behavior. In web server logs it is difficult to identify the HTTP requests generated by these automated traverses due to circumventing identity. Unsupervised clustering methods may be useful for categorizing the HTTP user sessions into web robot and human sessions. In this paper, three clustering algorithms such as clustering large application (CLARA), ordering points to identify clustering (OPTICS) and balanced iterative reducing and clustering using hierarchy (BIRCH) are used to cluster the session data. The used clustering algorithms are considered from different categories such as Partition-Based, density-based and hierarchy-based respectively. The used algorithms are implemented in ELKI and JBIRCH open source libraries and applied on publicly available user session data. The comparative clustering performance of algorithms is done using cluster validity measures including Rand Index, Jaccard Index, and F-measure. The effective time taken by each measure for clustering web robot sessions and distinguishing from other three classes is also measured.

Full Text