Abstract

Evaluation of information retrieval (IR) systems has recently been exploring the use of preference judgments over two search result lists. Unlike the traditional method of collecting relevance labels per single result, this method allows to consider the interaction between search results as part of the judging criteria. For example, one result list may be preferred over another if it has a more diverse set of relevant results, covering a wider range of user intents. In this paper, we investigate how assessors determine their preference for one list of results over another with the aim to understand the role of various relevance dimensions in preference-based evaluation. We run a series of experiments and collect preference judgments over different relevance dimensions in side-by-side comparisons of two search result lists, as well as relevance judgments for the individual documents. Our analysis of the collected judgments reveals that preference judgments combine multiple dimensions of relevance that go beyond the traditional notion of relevance centered on topicality. Measuring performance based on single document judgments and NDCG aligns well with topicality based preferences, but shows misalignment with judges' overall preferences, largely due to the diversity dimension. As a judging method, dimensional preference judging is found to lead to improved judgment quality.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.