Abstract

Many term-weighting models have been proposed for information retrieval. In this paper we investigate the extent to how retrieval effectiveness of term-weighting models is affected by the presence of spam Web pages. We perform retrieval experiments on the ClueWeb09-English (Category A) dataset – a substantial fraction of which are spam pages that are deliberately designed to manipulate commercial search engines – as well as the ClueWeb12 (Category A) dataset. Ad hoc tasks of TREC Web tracks 2009 through 2012 are completed to examine the spam sensitivity of the state-of-the-art retrieval models using Apache Lucene as the retrieval engine. Moreover, ad hoc tasks of two Web tracks and two Tasks tracks 2013 through 2016 are also included in a part of the experiment where the number of documents that are explicitly judged as spam in the search results returned by each retrieval model is inspected.Our experimental results show that hypergeometric models of information retrieval are more immune than other models to spam content. All the results presented in this article are fully repeatable and reproducible with data and code available online at a public GitHub repository.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call