SpiderTrap—An Innovative Approach to Analyze Activity of Internet Bots on a Website

Piotr Lewandowski,Anna Felkner,Marek Janiszewski

doi:10.1109/access.2020.3012969

Abstract

The main idea behind creating SpiderTrap was to build a website that can track how Internet bots crawl it. To track bots, honeypot dynamically generates different types of the hyperlinks on the web pages leading from one article to another and logs information passed by web clients in HTTP requests when visiting these links. By analyzing the sequences of visited links and passed HTTP requests it is possible to: detect bots, reveal bots' crawling or scanning algorithms, and other characteristic features of the traffic they generate. In our research we focused on identifying and describing whole bots' operations rather than just classifying single HTTP requests. This novel approach has given us insight into what different types of Internet bots are looking for and how they work. This information can be used to optimize the websites for search engines' bots for a better place on a search's results page or prepare a set of rules for tools that filter traffic to the web pages to minimize the impact of bad and unwanted bots on the websites' availability and security. We present the results of the five months of SpiderTrap's activity when honeypot was accessible by two domains (.pl and.eu), as well as by an IP address. The results show examples of activity of well-known Internet bots, such as Googlebot or Bingbot, unknown crawlers, and scanners trying to exploit vulnerabilities in the most popular web frameworks or looking for active webshells (i.e. access points to control a web server left by other attackers).

Highlights

IntroductionAccording to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes (that is a 1.5 trillion gigabytes). 17% of the traffic was related to web pages and raw data (excluding all video-related traffic or file sharing, which took 75% and 7% respectively)
According to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes. 17% of the traffic was related to web pages and raw data
Using a simple list of rules we were able to classify requests as offensive or inoffensive. Merging these two sorts of data gives the ability to create rules applicable to the web servers or web application firewalls to filter out traffic from unwanted bots

Summary

Introduction

According to the Cisco’s white paper [1], all IP traffic in 2017 was about 1.5 zettabytes (that is a 1.5 trillion gigabytes). 17% of the traffic was related to web pages and raw data (excluding all video-related traffic or file sharing, which took 75% and 7% respectively). Cisco estimated annual Internet traffic in 2018 at around 1.8 zettabytes. According to the Distil Networks’ report ’’2019 Bad Bot Report’’ [2] in 2018, 62.1% of all Internet requests they logged were generated by users, 17.5% by good bots, and 20.4% by bad bots. We understand a computer program or a script for automatic visiting web pages, where bad bots intend to steal information from the websites or attack them while good bots intend to classify or catalog websites’ content following the rules set by the websites’ owners. Bad bots pose different types of threats to the websites. These threats differ from scraping content, through different

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

SpiderTrap—An Innovative Approach to Analyze Activity of Internet Bots on a Website

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

PYTHON FOR WEB DEVELOPMENT
Uday Patkar ... Priyanshu Singh
International Journal of Computer Science and Mobile Computing | VOL. 11
Uday Patkar, et. al.Uday Patkar ... Priyanshu Singh
30 Apr 2022
International Journal of Computer Science and Mobile Computing | VOL. 11

Automated energy optimization of HTTP requests for mobile applications
Ding Li ... Jiaping Gui
-
Ding Li, et. al.Ding Li ... Jiaping Gui
14 May 2016
14 May 2016

Modeling Attackers Based on Heterogenous Graph through Malicious HTTP Requests
Shengqin Ao ... Ning Luo
-
Shengqin Ao, et. al.Shengqin Ao ... Ning Luo
05 May 2021
05 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SpiderTrap—An Innovative Approach to Analyze Activity of Internet Bots on a Website

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access