Abstract

Web spam detection is a critical issue in today’s rapidly growing usage of the Internet and the World Wide Web. The upsurge of web spam has significantly deteriorated the Quality of Services (QoS) of the World Wide Web. The degeneration of the quality of search engine results has given rise to researches on the detection of spam pages efficiently and accurately. Existing user-behaviour oriented web spam detection models employed the content-based, link-based and other features of webpages for classification of web spams. These user-behaviour techniques either implemented singly or combined has achieved good detection performance. However, the effectiveness of these features in identifying Web spams correctly needs to be determined. In this study, predictive web spam detection models that employed all related user-behaviour features of webpages were developed and evaluated. The content, link, and obvious-based features datasets were collected from an online repository. Relevant features were extracted using an improved Filter-based method. Six user-behaviour related features extracted from the datasets were used to combine the datasets to generate all possible subset of feature space required, such that 7 new datasets were generated for the study. Multi-Layer Perceptron (MLP) approach was adopted as a classifier for each of the identified features. Python Machine Learning Library was used to simulate the models using percentage splits of 60/40%, 70/30% and 80/20% ratio for training/testing dataset and the performances were evaluated using accuracy, True Positive (TP) rate, False Positive (FP) rate and precision as metrics. The result showed that for the majority of the datasets the formulated models have shown an increase in efficiency after feature selection. The MLP classifier was able to achieve the best result of 66.0% accuracy when the link-based dataset was used with feature selection. The study concluded that link-based features of a user is sufficient and effective for the detection of web spams. Keywords: Webspam, Content-based, Link-based, features, user-behaviour, evaluation DOI : 10.7176/NCS/10-07 Publication date :December 31 st 2019

Highlights

  • Web Spams are unsolicited, unwanted email, ads, links, contents, sent indiscriminately, directly or indirectly by a sender having no current relationship with the recipient, or an unsolicited commercial mail usually sent to a large group of recipients at the same time by service providers such as Internet Service providers (ISPs) (Ndumiyana et al, 2013)

  • 4.1 Data Analysis results The data collected contained 3999 records of web pages which were assessed as spam and non-spam pages based on the user-behaviour scores alongside the features identified from the 3 classes of datasets collected

  • This approach was based on the principle of the Wisdom of the Crowds which focused on using the interaction of users with the webpages to determine the nature of the webpages visited based on the characteristic nature of the six (6) user-behaviour features identified

Read more

Summary

Introduction

Web Spams are unsolicited, unwanted email, ads, links, contents, sent indiscriminately, directly or indirectly by a sender having no current relationship with the recipient, or an unsolicited commercial mail usually sent to a large group of recipients at the same time by service providers such as Internet Service providers (ISPs) (Ndumiyana et al, 2013). Some of the nefarious act posed by web spam includes subverting the ranking algorithms of web search engines and cause them to rank search results higher than they would otherwise (Najork, 2009) etc. This spam situation is so disruptive and infuriating that search engines, web users, and email receivers spend a lot of time trying to combat it since it leads to the loss of lots of resources, finance, and cost. Spam classifiers and filterers are been created in search to combat web spam, but due to the smartness of the spammers, they always develop new ways to manipulate their way into search engines (Castillo et al, 2007). Most of these algorithms are not pro-active and cannot withstand the pressure from spam pages, because if the spams links, emails, web pages are blocked by the filterer’s or classifiers for the first time, within no time the spam will replicate itself and hit the algorithm for as many times until it makes its way through

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call