Detecting spam web pages using multilayer extreme learning machine

Rajendra Kumar Roul

doi:10.1504/ijbdi.2018.10008141

Abstract

Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which mislead the search engine to obtain high-quality information. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either spam or non-spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAM-UK2002 and WEBSPAM-UK2006) have been used and the results using multilayer ELM are found to be more promising compared to other established classifiers.

Full Text