Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model

Inwoo Ro,Eul Gyu Im,Joong Soo Han

doi:10.1155/2018/9065424

Inwoo Ro, Eul Gyu Im + Show 1 more

Open Access

https://doi.org/10.1155/2018/9065424

Copy DOI

Abstract

This paper proposes an advanced countermeasure against distributed web-crawlers. We investigated other methods for crawler detection and analyzed how distributed crawlers can bypass these methods. Our method can detect distributed crawlers by focusing on the property that web traffic follows the power distribution. When we sort web pages by the number of requests, most of requests are concentrated on the most frequently requested web pages. In addition, there will be some web pages that normal users do not generally request. But crawlers will request for these web pages because their algorithms are intended to request iteratively by parsing web pages to collect every item the crawlers encounter. Therefore, we can assume that if some IP addresses are frequently used to request the web pages that are located in the long-tail area of a power distribution graph, those IP addresses can be classified as crawler nodes. The experimental results with NASA web traffic data showed that our method was effective in identifying distributed crawlers with 0.0275% false positives when a conventional frequency-based detection method shows 2.882% false positives with an equal access threshold.

Highlights

Web crawling is used in various fields to collect data [1, 2]
Some companies prohibit web-crawlers from access their web pages because of the following reasons: First, webcrawlers may degrade the availability of web servers
If the number of requests from a client exceeds a certain threshold within the predefined duration, the web server classifies the client as a crawler [4]

Summary

Introduction

Web crawling is used in various fields to collect data [1, 2]. Some web services try to detect crawling activities and to prevent crawlers from accessing web pages through anticrawler methods, but some malicious web-crawlers bypass detection methods by modifying their header values or by distributing source IP addresses to masquerade itself as if they are normal users. The experimental results showed that our method can effectively identify distributed crawlers with 0.0275% false positives. In the conventional frequencybased method, when the threshold is increased to detect more crawler nodes, false positives increase . This paper is organized as follows: Section 2 describes about conventional anticrawling methods and how distributed crawlers can bypass them.

Related Works

Blocking Distributed Crawlers

Experiments

Web Traffic Data

Conclusion

Findings

Future Works

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Security and Communication Networks	Publication Date: Dec 4, 2018
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks

Lead the way for us

Similar Papers

Making Recommendations from Web Archives for "Lost" Web Pages
Lulwah M Alkwai ... Michael L Nelson
-
Lulwah M Alkwai, et. al.Lulwah M Alkwai ... Michael L Nelson
01 Aug 2020
01 Aug 2020

Mining the Web and the Internet for Accurate IP Address Geolocations
C Guo ... Y Zhang
-
C Guo, et. al.C Guo ... Y Zhang
01 Apr 2009
01 Apr 2009

An Empirical Study of Unsolicited Content Injection into a Website
Fu Jianming ... Xie Mengfei
-
Fu Jianming, et. al.Fu Jianming ... Xie Mengfei
01 Oct 2017
01 Oct 2017

Conventional and Novel Rapid Methods for Detection and Enumeration of Microorganisms
Purnendu C Vasavada ... Roy Betts
-
Purnendu C Vasavada, et. al.Purnendu C Vasavada ... Roy Betts
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks