Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Marco Ferrante,Nicola Ferro,Norbert Fuhr

doi:10.1109/access.2021.3116857

Marco Ferrante, Nicola Ferro + Show 1 more

Open Access

https://doi.org/10.1109/access.2021.3116857

Copy DOI

Abstract

Information Retrieval (IR) is a discipline deeply rooted in evaluation since its inception. Indeed, experimentally measuring and statistically validating the performance of IR systems are the only possible ways to compare systems and understand which are better than others and, ultimately, more effective and useful for end-users. Since the seminal paper by Stevens (1946), it is known that the properties of the measurement scales determine the operations you should or should not perform with values from those scales. For example, Stevens suggested that you can compute means and variances only when you are working with, at least, interval scales. It was recently shown that the most popular evaluation measures in IR are not interval-scaled. However, so far, there has been little or no investigation in IR on the impact and consequences of departing from scale assumptions. Taken to the extremes, it might even mean that decades of experimental IR research used potentially improper methods, which may have produced results needing further validation. However, it was unclear if and to what extent these findings apply to actual evaluations; this opened a debate in the community with researchers standing on opposite positions about whether this should be considered an issue (or not) and to what extent. In this paper, we first give an introduction to the representational measurement theory explaining why certain operations and significance tests are permissible only with scales of a certain level. For that, we introduce the notion of meaningfulness specifying the conditions under which the truth (or falsity) of a statement is invariant under permissible transformations of a scale. Furthermore, we show how the recall base and the length of the run may make comparison and aggregation across topics problematic. Then we propose a straightforward and powerful approach for turning an evaluation measure into an interval scale, and describe an experimental evaluation of the differences between the original measures and the interval-scaled ones. For all the regarded measures – namely Precision, Recall, Average Precision, (Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal Rank - we observe substantial effects, both on the order of average values and on the outcome of significance tests. For the latter, previously significant differences turn out to be insignificant, while insignificant ones become significant. The effect varies remarkably between the tests considered but on average, we observed a 25% change in the decision about which systems are significantly different and which are not. These experimental findings further support the idea that measurement scales matter and that departing from their assumptions has an impact. This not only suggests that, to the extent possible, it would be better to comply with such assumptions but it also urges us to clearly indicate when we depart from such assumptions and, carefully, point out the limitations of the conclusions we draw and under which conditions they are drawn.

Highlights

The basic idea is that real world objects have attributes which constitute their relevant features and induce a set of relationship among them; the set of objects E together with the relationships RE1, RE2, . . . among them comprise the so-called Empirical Relational System (ERS) E = E, RE1, RE2, . . . . we look for a mapping between the real word objects E and numbers N in such a way that the relationships RE1, RE2, . . . among the objects match with relationships set of numbers N together
The fact that Information Retrieval (IR) evaluation measures, apart from Precision, Recall, and Rank-Biased Precision (RBP) with p = 0.5, are not interval scales leads to the general issues with computing means, statistical tests, and meaningfulness discussed in Sections from III-B to VIII-C and shown in Examples 3 and 7
Ferrante et al [41] have demonstrated that Precision, Recall, and F-measure are interval scales when you fix the length of the run N and the recall base RB; this is the case of a set of runs on a single topic

Summary

INTRODUCTION

By virtue or by necessity, Information Retrieval (IR) has always been deeply rooted in experimentation and evaluation has been a formidable driver of innovation and advancement in the field, as witnessed by the success of the major evaluation initiatives – Text REtrieval Conference (TREC) in the United States [56], Conference and Labs of the Evaluation Forum (CLEF) in Europe [44], NII Testbeds and. We show how the recall base and the length of the run may make averaging across topics (or other forms of aggregate statistics) problematic, at best; 4) proposal of a straightforward and powerful approach for turning an evaluation measure into an interval scale, by transforming its values into their rank position In this way, we provide a means for improving the meaningfulness and validity of our inferences, still preserving the different user models embedded by the various evaluation measures; 5) experimental evaluation of the differences between using the original measures and the interval-scaled ones, by relying on several TREC collections.

EXPERIMENTAL EVALUATION IN IR

OVERVIEW

AVERAGING ACROSS TOPICS AND CORRELATION

WHY MAY SCALES CHANGE FROM TOPIC TO TOPIC OR FROM RUN LENGTH TO RUN LENGTH?

4) SUMMARY AND DISCUSSION

TRANSFORMING IR MEASURES TO INTERVAL SCALES

IMPLEMENTATION

LIMITATIONS

EXPERIMENTAL SETUP We regard the following evaluation measures

Findings

CONCLUSION AND FUTURE WORK

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A General Theory of IR Evaluation Measures
Marco Ferrante ... Silvia Pontarollo
IEEE Transactions on Knowledge and Data Engineering | VOL. 31
Marco Ferrante, et. al.Marco Ferrante ... Silvia Pontarollo
01 Mar 2019
IEEE Transactions on Knowledge and Data Engineering | VOL. 31

Measuring Stability and Discrimination Power of Metrics in Information Retrieval Evaluation
Huaji Shi ... Xiaolong Zhu
-
Huaji Shi, et. al.Huaji Shi ... Xiaolong Zhu
01 Jan 2013
01 Jan 2013

Social Informatics and Information Retrieval Systems
Xiaoya Tang
Bulletin of the American Society for Information Science and Technology | VOL. 26
Xiaoya TangXiaoya Tang
01 Feb 2000
Bulletin of the American Society for Information Science and Technology | VOL. 26

Batch Evaluation Metrics in Information Retrieval: Measures, Scales, and Meaning
Alistair Moffat
IEEE Access | VOL. 10
Alistair MoffatAlistair Moffat
01 Jan 2021
IEEE Access | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Towards Meaningful Statements in IR Evaluation: Mapping Evaluation Measures to Interval Scales

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access