Abstract

Web robots are automated software agents primarily used for web searching and indexing. Nowadays, Web robots are frequently used for performing malicious (spamming and spying, etc.) activities on the internet because of their camouflage behavior. In web server logs it is difficult to identify the HTTP requests generated by these automated traverses due to circumventing identity. Unsupervised clustering methods may be useful for categorizing the HTTP user sessions into web robot and human sessions. In this paper, three clustering algorithms such as clustering large application (CLARA), ordering points to identify clustering (OPTICS) and balanced iterative reducing and clustering using hierarchy (BIRCH) are used to cluster the session data. The used clustering algorithms are considered from different categories such as Partition-Based, density-based and hierarchy-based respectively. The used algorithms are implemented in ELKI and JBIRCH open source libraries and applied on publicly available user session data. The comparative clustering performance of algorithms is done using cluster validity measures including Rand Index, Jaccard Index, and F-measure. The effective time taken by each measure for clustering web robot sessions and distinguishing from other three classes is also measured.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call