Abstract
Automated programs (bots) are responsible for a large percentage of website traffic. These bots can either be used for benign purposes, such as Web indexing, Website monitoring (validation of hyperlinks and HTML code), feed fetching Web content and data extraction for commercial use or for malicious ones, including, but not limited to, content scraping, vulnerability scanning, account takeover, distributed denial of service attacks, marketing fraud, carding and spam. To ensure their security, Web servers try to identify bot sessions and apply special rules to them, such as throttling their requests or delivering different content. The methods currently used for the identification of bots are based either purely on rule-based bot detection techniques or a combination of rule-based and machine learning techniques. While current research has developed highly adequate methods for Web bot detection, these methods' adequacy when faced with Web bots that try to remain undetected hasn't been studied. For this reason, we created and evaluated a Web bot detection framework on its ability to detect conspicuous bots separately from its ability to detect advanced Web bots. We assessed the proposed framework performance using real HTTP traffic from a public Web server. Our experimental results show that the proposed framework has significant ability to detect Web bots that do not try to hide their bot identity using HTTP Web logs (balanced accuracy in a false-positive intolerant server > 95%). However, detecting advanced Web bots that present a browser fingerprint and may present a humanlike behaviour as well is considerably more difficult.
Highlights
The vast amount of content hosted on the Internet has rendered the use of Web bots necessary
The most famous techniques for Web bot detection are based on the CAPTCHA (i.e. Completely Automated Public Turing test to tell Computers and Humans Apart) [28] such as the reCAPTCHA2 offered by Google
The purpose of this paper is to identify the unique challenges that arise when state-of-the-art Web bot detection techniques are utilised for detecting advanced Web bots as opposed to simple bots
Summary
The vast amount of content hosted on the Internet has rendered the use of Web bots necessary. Popular uses of Web bots include Web indexing, Website monitoring (validation of hyperlinks and HTML code), data extraction for commercial purposes and feed fetching Web content. To perform these actions, bots visit Web servers repeatedly and, in some cases, for a prolonged period of time [10]. Allowing bots unrestricted access to Web server content and services is not a good practice. The most famous techniques for Web bot detection are based on the CAPTCHA (i.e. Completely Automated Public Turing test to tell Computers and Humans Apart) [28] such as the reCAPTCHA2 offered by Google. The test uses the assumption that a human can extract letters from either a distorted image or the audio file or select an object in an image, while a Web bot cannot
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.