Identifying Spam Web Pages Based on Content Similarity

Maria Soledad Pera,Yiu-Kai Ng

doi:10.1007/978-3-540-69848-7_18

Abstract

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spamWeb pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.

Full Text