An Overview of Web Robots Detection Techniques

Hanlin Chen,Hongmei He,Andrew Starr

doi:10.1109/cybersecurity49315.2020.9138856

Abstract

Web robots or web crawlers have become the major source of web traffic. While some robots are well-behaving such as search engines, others can perform DDoS attacks, which put great threats on websites. Effectively detecting web robots will benefit not only for network traffic cleaning, but also for improving the cybersecurity of IoT enabled systems and services. To get the state of the arts in web robot detection, this paper reviews recent decade research on web robot or web robot/crawler detection techniques and compares their performances and identify the challenges of different techniques, thus providing researchers a reference for the development of web robots detection in real applications. To protect web content from malicious web robots, researchers have investigated various approaches, but they can be classified into three themes: offline web log analysis, honeypots and online robot detection. We conclude that off-line web log analysis methods have quite high accuracy, but they are time-consuming compared to online detection methods. Honeypots, as a computer security mechanism, can be used to engage and deceive hackers and identify malicious activities performed over the Internet, but they may block legitimate robots. The review shows that a hybrid method is better than an individual classifier, and the performance of online web robot detection needs to be improved. Also, different types of features could play different roles in different machine learning models. Therefore, feature selection is important for web robot/crawler detection.

Full Text