Abstract

The main idea behind creating SpiderTrap was to build a website that can track how Internet bots crawl it. To track bots, honeypot dynamically generates different types of the hyperlinks on the web pages leading from one article to another and logs information passed by web clients in HTTP requests when visiting these links. By analyzing the sequences of visited links and passed HTTP requests it is possible to: detect bots, reveal bots' crawling or scanning algorithms, and other characteristic features of the traffic they generate. In our research we focused on identifying and describing whole bots' operations rather than just classifying single HTTP requests. This novel approach has given us insight into what different types of Internet bots are looking for and how they work. This information can be used to optimize the websites for search engines' bots for a better place on a search's results page or prepare a set of rules for tools that filter traffic to the web pages to minimize the impact of bad and unwanted bots on the websites' availability and security. We present the results of the five months of SpiderTrap's activity when honeypot was accessible by two domains (.pl and.eu), as well as by an IP address. The results show examples of activity of well-known Internet bots, such as Googlebot or Bingbot, unknown crawlers, and scanners trying to exploit vulnerabilities in the most popular web frameworks or looking for active webshells (i.e. access points to control a web server left by other attackers).

Highlights

  • IntroductionAccording to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes (that is a 1.5 trillion gigabytes). 17% of the traffic was related to web pages and raw data (excluding all video-related traffic or file sharing, which took 75% and 7% respectively)

  • According to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes. 17% of the traffic was related to web pages and raw data

  • Using a simple list of rules we were able to classify requests as offensive or inoffensive. Merging these two sorts of data gives the ability to create rules applicable to the web servers or web application firewalls to filter out traffic from unwanted bots

Read more

Summary

Introduction

According to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes (that is a 1.5 trillion gigabytes). 17% of the traffic was related to web pages and raw data (excluding all video-related traffic or file sharing, which took 75% and 7% respectively). Cisco estimated annual Internet traffic in 2018 at around 1.8 zettabytes. According to the Distil Networks’ report ’’2019 Bad Bot Report’’ [2] in 2018, 62.1% of all Internet requests they logged were generated by users, 17.5% by good bots, and 20.4% by bad bots. We understand a computer program or a script for automatic visiting web pages, where bad bots intend to steal information from the websites or attack them while good bots intend to classify or catalog websites’ content following the rules set by the websites’ owners. Bad bots pose different types of threats to the websites. These threats differ from scraping content, through different

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.