A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

De Wang,Danesh Irani,Calton Pu

doi:10.1142/s0218843014410019

Abstract

Identifying and detecting web spam is an ongoing battle between spam-researchers and spammers which has been going on since search engines allowed searching of web pages to the modern sharing of web links via social networks. A common challenge faced by spam-researchers is the fact that new techniques depend on requiring a corpus of legitimate and spam web pages. Although large corpora of legitimate web pages are available to researchers, the same cannot be said about web spam or spam web pages. In this paper, we introduce the Webb Spam Corpus 2011 — a corpus of approximately 330,000 spam web pages — which we make available to researchers in the fight against spam. By having a standard corpus available, researchers can collaborate better on developing and reporting results of spam filtering techniques. The corpus contains web pages crawled from links found in over 6.3 million spam emails. We analyze multiple aspects of this corpus including redirection, HTTP headers, web page content, and classification evaluation. We also provide insights into changes in web spam since the last Webb Spam Corpus was released in 2006. These insights include: (1) spammers manipulate social media in spreading spam; (2) HTTP headers and content also change over time; (3) spammers have evolved and adopted new techniques to avoid the detection based on HTTP header information.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

Abstract

Talk to us

Similar Papers

More From: International Journal of Cooperative Information Systems

Lead the way for us

Journal: International Journal of Cooperative Information Systems	Publication Date: Jun 1, 2014
Citations: 4

Similar Papers

Google Penguin: Evasion in Non-English Languages and a New Classifier
Abdulrahman Alarifi ... Ahmad Alkhaledi
-
Abdulrahman Alarifi, et. al.Abdulrahman Alarifi ... Ahmad Alkhaledi
01 Dec 2013
01 Dec 2013

An Improved Framework for Content- and Link-Based Web-Spam Detection: A Combined Approach
Asim Shahzad ... Muhammad Zubair Rehman
Complexity | VOL. 2021
Asim Shahzad, et. al.Asim Shahzad ... Muhammad Zubair Rehman
15 Nov 2021
Complexity | VOL. 2021

A link and Content Hybrid Approach for Arabic Web Spam Detection
Heider A Wahsheh ... Izzat M Alsmadi
International Journal of Intelligent Systems and Applications | VOL. 5
Heider A Wahsheh, et. al.Heider A Wahsheh ... Izzat M Alsmadi
01 Dec 2012
International Journal of Intelligent Systems and Applications | VOL. 5

Identifying Spam Web Pages Based on Content Similarity
Maria Soledad Pera ... Yiu-Kai Ng
-
Maria Soledad Pera, et. al.Maria Soledad Pera ... Yiu-Kai Ng
30 Jun 2008
30 Jun 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Perspective of Evolution After Five Years: A Large-Scale Study of Web Spam Evolution

Abstract

Talk to us

Similar Papers

More From: International Journal of Cooperative Information Systems