Link farm detection using SVMLight tool

D Saraswathi,R Kavitha,A Vijaya Kathiravan

doi:10.1109/iccci.2012.6158833

Abstract

Search Engine spam is a web page or a portion of a web page which has been created with the intention of increasing its ranking in search engines. Web spamming refers to actions intended to mislead search engines and give some pages higher ranking than they deserve. Anyone who uses a search engine frequently has most likely encountered a high ranking page that consists of nothing more than a bunch of query keywords. These pages detract both from the user experience and from the quality of the search engine. Search engine spam is a webpage that has been designed to artificially inflating its search engine ranking. Recently this search engine spam has been increased dramatically and creates problem to the search engine and the web surfer. It degrades the search engine's results, occupies more memory and consumes more time for creating indexes, and frustrates the user by giving irrelevant results. Search engines have tried many techniques to filter out these spam pages before they can appear on the query results page. In this paper, various ways of creating spam pages, a collection of current methods that are being used to detect spam, and a new approach to build a tool for link spam detection that uses machine learning as a means for detecting spam. This new approach uses SVMLight tool to detect the link spam which only considers the link structure of Web, regardless of page contents. These statistical features are used to build a classifier that is tested over a large collection of Web link spam. The link farm can identify based on degree Hub and Authorities of link. The spam classifier makes use of the Wordnet word database and SVMLight tool to classify web links as either spam or not spam. These features are not only related to quantitative data extracted from the Web pages, but also to qualitative properties, mainly of the page links.

Full Text