Detection and confirmation of web robot requests for cleaning the voluminous web log data

Tanvir Habib Sardar,Zahid Ansari

doi:10.1109/impetus.2014.6775871

Abstract

Web robots are software applications that run automated tasks over the internet. They traverse the hyperlink structure of the World Wide Web so that they can retrieve information. There are many reasons to distinguish web robot requests and user requests. Some tasks of web robots can be harmful to the web. Firstly, Web robots are employed for assemble business intelligence at e-commerce sites. In such a state of affairs, the e-commerce site may need to detect robots. Secondly, many e-commerce sites carry out Web traffic scrutiny to deduce the way their customers have accessed the site. Unfortunately, such scrutiny can be erroneous by the presence of Web robots. Thirdly, Web robots often consume considerable network bandwidth and server resources at the expense of other users. A web log file is a web server file automatically created and maintained by a web server to check the activity performed by it. It maintains a history of page requests on its site. In this paper we have used four methods together to detect and finally confirm requests as a robot request. Experiments have been performed on the log file generated from the server of an operational web site named vtulife.com which contains data of march-20l3. In our research results o.f web robot detection using various techniques have been compared and an integrated approach is proposed for the confirmation of the robot request.

Full Text