Abstract

AbstractThe evaluation of Information Retrieval (IR) systems has recently been exploring the use of preference judgments over two lists of search results, presented side-by-side to judges. Such preference judgments have been shown to capture a richer set of relevance criteria than traditional methods of collecting relevance labels per single document. However, preference judgments over lists are expensive to obtain and are less reusable as any change to either side necessitates a new judgment. In this paper, we propose a way to measure the dissimilarity between two sides in side-by-side evaluation experiments and show how this measure can be used to prioritize queries to be judged in an offline setting. Our proposed measure, referred to as Weighted Ranking Difference (WRD), takes into account both the ranking differences and the similarity of the documents across the two sides, where a document may, for example, be a URL or a query suggestion. We empirically evaluate our measure on a large-scale, real-world dataset of crowdsourced preference judgments over ranked lists of auto-completion suggestions. We show that the WRD score is indicative of the probability of tie preference judgments and can, on average, save 25% of the judging resources.KeywordsSide-by-side evaluationpreference judgmentsweighted ranked difference measurequery prioritizationjudging cost reduction

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call