Detecting Web Spam Based on Novel Features from Web Page Source Code

Jiayong Liu,Cheng Huang,Yu Su,Shun Lv

doi:10.1155/2020/6662166

Jiayong Liu, Cheng Huang + Show 2 more

Open Access

https://doi.org/10.1155/2020/6662166

Copy DOI

Journal: Security and Communication Networks	Publication Date: Dec 17, 2020
Citations: 7	License type: CC BY 4.0

Affiliation: Sichuan University

Abstract

Search engine is critical in people’s daily life because it determines the information quality people obtain through searching. Fierce competition for the ranking in search engines is not conducive to both users and search engines. Existing research mainly studies the content and links of websites. However, none of these techniques focused on semantic analysis of link and anchor text for detection. In this paper, we propose a web spam detection method by extracting novel feature sets from the homepage source code and choosing the random forest (RF) as the classifier. The novel feature sets are extracted from the homepage’s links, hypertext markup language (HTML) structure, and semantic similarity of content. We conduct experiments on the WEBSPAM-UK2007 and UK-2011 dataset using a five-fold cross-validation method. Besides, we design three sets of experiments to evaluate the performance of the proposed method. The proposed method with novel feature sets is compared with different indicators and has better performance than other methods with a precision of 0.929 and a recall of 0.930. Experiment results show that the proposed model could effectively detect web spam.

Highlights

With the rapid development of the network, web applications are becoming more and more popular in the recent years, among which search engines are one of the most common web tools for people to gain information every day [1]
Spammers design pages delicately to improve rankings as most users only access the first page of search results. ere has been a brief definition of web spamming in the literature [5]; shortly speaking, web spamming is a black-hat search engine optimization (SEO) that deceive search engines to increase the ranking of a page in search engine results. ese web pages are called web spam
Trees of random forest (RF) algorithm are independent during the training process. e final result is obtained by voting of all trees

Summary

Introduction

With the rapid development of the network, web applications are becoming more and more popular in the recent years, among which search engines are one of the most common web tools for people to gain information every day [1]. Ere has been a brief definition of web spamming in the literature [5]; shortly speaking, web spamming is a black-hat search engine optimization (SEO) that deceive search engines to increase the ranking of a page in search engine results. Spammers try to deceive search engines and attract end users to click on web spam sites. Ey reduce the effectiveness and efficiency of search engine results since web spam pages take much time to process but may be full of malicious content and links. Search engine companies have utilized various methods to counter spam [7], it is still a challenge to prevent the increase of blackhat SEO technology and the growth of spam pages nowadays

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Detecting Web Spam Based on Novel Features from Web Page Source Code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks

Lead the way for us

Similar Papers

A Novel Set of Contextual Features for Web Spam Detection
...
International Journal of Nonlinear Analysis and Applications | VOL. 11
, et. al. ...
01 Jan 2020
International Journal of Nonlinear Analysis and Applications | VOL. 11

Detecting Web Spam in Webgraphs with Predictive Model Analysis
Naw Safrin Sattar ... Shaikh Arifuzzaman
-
Naw Safrin Sattar, et. al.Naw Safrin Sattar ... Shaikh Arifuzzaman
01 Dec 2019
01 Dec 2019

Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers.
Mansour Alsaleh ... Abdulrahman Alarifi
PloS one | VOL. 11
Mansour Alsaleh, et. al.Mansour Alsaleh ... Abdulrahman Alarifi
17 Nov 2016
PloS one | VOL. 11

Russian web spam evolution
Sergey Pevtsov ... Sergey Volkov
-
Sergey Pevtsov, et. al.Sergey Pevtsov ... Sergey Volkov
13 May 2013
13 May 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Detecting Web Spam Based on Novel Features from Web Page Source Code

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Security and Communication Networks