Web Spam Detection: New Approach with Hidden Markov Models

Ali Asghar Torabi,Kaveh Taghipour,Shahram Khadivi

doi:10.1007/978-3-642-45068-6_21

Abstract

Web Spam is the result of a number of methods to deceive search engine algorithms so as to obtain higher ranks in the search results. Advanced spammers use keyword and link stuffing methods to create farms of spam pages. Most of the recent works in the web spam detection literature utilize graph based methods to enhance the accuracy of this task. This paper is basically a probabilistic approach that uses content and link based features to detect the web spam pages. Since we observe there is a high connectivity between web spam pages, we adopt a method based on Hidden Markov Model to exploit conditional dependency of a sequence of hosts and their spam/normal class distribution of each host. Experimental results show that the proposed method can significantly improve the performance of baseline classifier.

Full Text