Relevance dimensions in preference-based IR evaluation

Jinyoung Kim,Imed Zitouni,Gabriella Kazai

doi:10.1145/2484028.2484168

Abstract

Evaluation of information retrieval (IR) systems has recently been exploring the use of preference judgments over two search result lists. Unlike the traditional method of collecting relevance labels per single result, this method allows to consider the interaction between search results as part of the judging criteria. For example, one result list may be preferred over another if it has a more diverse set of relevant results, covering a wider range of user intents. In this paper, we investigate how assessors determine their preference for one list of results over another with the aim to understand the role of various relevance dimensions in preference-based evaluation. We run a series of experiments and collect preference judgments over different relevance dimensions in side-by-side comparisons of two search result lists, as well as relevance judgments for the individual documents. Our analysis of the collected judgments reveals that preference judgments combine multiple dimensions of relevance that go beyond the traditional notion of relevance centered on topicality. Measuring performance based on single document judgments and NDCG aligns well with topicality based preferences, but shows misalignment with judges' overall preferences, largely due to the diversity dimension. As a judging method, dimensional preference judging is found to lead to improved judgment quality.

Full Text