Content-based analysis to detect Arabic web spam

Mohammed Al-Kabi,Izzat Alsmadi,Heider Wahsheh,Emad Al-Shawakfa,Ahmed Al-Hmoud,Abdullah Wahbeh

doi:10.1177/0165551512439173

Abstract

Search engines are important outlets for information query and retrieval. They have to deal with the continual increase of information available on the web, and provide users with convenient access to such huge amounts of information. Furthermore, with this huge amount of information, a more complex challenge that continuously gets more and more difficult to illuminate is the spam in web pages. For several reasons, web spammers try to intrude in the search results and inject artificially biased results in favour of their websites or pages. Spam pages are added to the internet on a daily basis, thus making it difficult for search engines to keep up with the fast-growing and dynamic nature of the web, especially since spammers tend to add more keywords to their websites to deceive the search engines and increase the rank of their pages. In this research, we have investigated four different classification algorithms (naïve Bayes, decision tree, SVM and K-NN) to detect Arabic web spam pages, based on content. The three groups of datasets used, with 1%, 15% and 50% spam contents, were collected using a crawler that was customized for this study. Spam pages were classified manually. Different tests and comparisons have revealed that the Decision Tree was the best classifier for this purpose.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Content-based analysis to detect Arabic web spam

Abstract

Talk to us

Similar Papers

More From: Journal of Information Science

Lead the way for us

Journal: Journal of Information Science	Publication Date: Apr 19, 2012
Citations: 20

Similar Papers

Google Penguin: Evasion in Non-English Languages and a New Classifier
Abdulrahman Alarifi ... Mansour Alsaleh
-
Abdulrahman Alarifi, et. al.Abdulrahman Alarifi ... Mansour Alsaleh
01 Dec 2013
01 Dec 2013

Analysis of Web Spam for Non-English Content: Toward More Effective Language-Based Classifiers.
Mansour Alsaleh ... Abdulrahman Alarifi
PloS one | VOL. 11
Mansour Alsaleh, et. al.Mansour Alsaleh ... Abdulrahman Alarifi
17 Nov 2016
PloS one | VOL. 11

A new enhanced technique for link farm detection
D Saraswathi ... R Kavitha
-
D Saraswathi, et. al.D Saraswathi ... R Kavitha
01 Mar 2012
01 Mar 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Content-based analysis to detect Arabic web spam

Abstract

Talk to us

Similar Papers

More From: Journal of Information Science