PathMarker: protecting web contents against inside crawlers

Shengye Wan,Kun Sun,Yue Li

doi:10.1186/s42400-019-0023-1

Shengye Wan, Kun Sun + Show 1 more

Open Access

https://doi.org/10.1186/s42400-019-0023-1

Copy DOI

Abstract

Web crawlers have been misused for several malicious purposes such as downloading server data without permission from the website administrator. Moreover, armoured crawlers are evolving against new anti-crawler mechanisms in the arm races between crawler developers and crawler defenders. In this paper, based on one observation that normal users and malicious crawlers have different short-term and long-term download behaviours, we develop a new anti-crawler mechanism called PathMarker to detect and constrain persistent distributed crawlers. By adding a marker to each Uniform Resource Locator (URL), we can trace the page that leads to the access of this URL and the user identity who accesses this URL. With this supporting information, we can not only perform more accurate heuristic detection using the path related features, but also develop a Support Vector Machine based machine learning detection model to distinguish malicious crawlers from normal users via inspecting their different patterns of URL visiting paths and URL visiting timings. In addition to effectively detecting crawlers at the earliest stage, PathMarker can dramatically suppress the scraping efficiency of crawlers before they are detected. We deploy our approach on an online forum website, and the evaluation results show that PathMarker can quickly capture all 6 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).

Highlights

With the prosperity of Internet data sources, the demand of crawlers is dramatically increasing
We develop a new anti-crawler mechanism called PathMarker that aims to detect and constrain persistent distributed inside crawlers, which have valid user accounts to stealthily scrape valuable website content
Since normal users cannot know the plaintext of Uniform Resource Locator (URL), it is difficult for the users to remember the URLs or infer the content of the web page

Summary

Introduction

With the prosperity of Internet data sources, the demand of crawlers is dramatically increasing. Machine learning detection mechanisms can detect malicious crawlers based on the different visiting patterns between normal users and malicious crawlers (Stevanovic et al 2013; Stassopoulou and Dikaiakos 2006; 2009) In other words, they first model the normal website access behaviour and define any other behaviour as abnormal. We develop a new anti-crawler mechanism called PathMarker that aims to detect and constrain persistent distributed inside crawlers, which have valid user accounts to stealthily scrape valuable website content. When a number of distributed crawlers collude in a download task, each individual crawler may have no obvious path pattern We solve this problem in PathMarker by automatically generating and appending a marker to each web page URL.

Background

Time interval variation of the longest short session

Findings

Discussion and limitations

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Cybersecurity	Publication Date: Feb 20, 2019
Citations: 12	License type: open-access

R Discovery Prime

R Discovery Prime

PathMarker: protecting web contents against inside crawlers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Cybersecurity

Lead the way for us

Similar Papers

Protecting web contents against persistent distributed crawlers
Shengye Wan ... Yue Li
-
Shengye Wan, et. al.Shengye Wan ... Yue Li
01 May 2017
01 May 2017

Evolution of U. S. Daily Newspaper Brand Names into Internet URLs
Quint Randle
Newspaper Research Journal | VOL. 22
Quint RandleQuint Randle
01 Jun 2001
Newspaper Research Journal | VOL. 22

Exploring Efficiency of GAN-based Generated URLs for Phishing URL Detection
Tuan Dung Pham ... Sy Tuong Hoang
-
Tuan Dung Pham, et. al.Tuan Dung Pham ... Sy Tuong Hoang
01 Oct 2021
01 Oct 2021

SOLinker: Constructing Semantic Links between Tags and URLs on StackOverflow
Wenkai Mo ... Zhenzheng Qian
-
Wenkai Mo, et. al.Wenkai Mo ... Zhenzheng Qian
01 Jun 2016
01 Jun 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PathMarker: protecting web contents against inside crawlers

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Cybersecurity