Constructing and meta-evaluating state-aware evaluation metrics for interactive search systems
Evaluation metrics such as precision, recall and normalized discounted cumulative gain have been widely applied in ad hoc retrieval experiments. They have facilitated the assessment of system performance in various topics over the past decade. However, the effectiveness of such metrics in capturing users’ in-situ search experience, especially in complex search tasks that trigger interactive search sessions, is limited. To address this challenge, it is necessary to adaptively adjust the evaluation strategies of search systems to better respond to users’ changing information needs and evaluation criteria. In this work, we adopt a taxonomy of search task states that a user goes through in different scenarios and moments of search sessions, and perform a meta-evaluation of existing metrics to better understand their effectiveness in measuring user satisfaction. We then built models for predicting task states behind queries based on in-session signals. Furthermore, we constructed and meta-evaluated new state-aware evaluation metrics. Our analysis and experimental evaluation are performed on two datasets collected from a field study and a laboratory study, respectively. Results demonstrate that the effectiveness of individual evaluation metrics varies across task states. Meanwhile, task states can be detected from in-session signals. Our new state-aware evaluation metrics could better reflect in-situ user satisfaction than an extensive list of the widely used measures we analyzed in this work in certain states. Findings of our research can inspire the design and meta-evaluation of user-centered adaptive evaluation metrics, and also shed light on the development of state-aware interactive search systems.
- Conference Article
30
- 10.1145/3292500.3330981
- Jul 25, 2019
User satisfaction is an important variable in Web search evaluation studies and has received more and more attention in recent years. Many studies regard user satisfaction as the ground truth for designing better evaluation metrics. However, most of the existing studies focus on designing Cranfield-like evaluation metrics to reflect user satisfaction at query-level. As information need becomes more and more complex, users often need multiple queries and multi-round search interactions to complete a search task (e.g. exploratory search). In those cases, how to characterize the user's satisfaction during a search session still remains to be investigated. In this paper, we collect a dataset through a laboratory study in which users need to complete some complex search tasks. With the help of hierarchical linear models (HLM), we try to reveal how user's query-level and session-level satisfaction are affected by different cognitive effects. A number of interesting findings are made. At query level, we found that although the relevance of top-ranked documents have important impacts (primacy effect), the average/maximum of perceived usefulness of clicked documents is a much better sign of user satisfaction. At session level, perceived satisfaction for a particular query is also affected by the other queries in the same session (anchor effect or expectation effect). We also found that session-level satisfaction correlates mostly with the last query in the session (recency effect). The findings will help us design better session-level user behavior models and corresponding evaluation metrics.
- Conference Article
16
- 10.1145/3459637.3482190
- Oct 26, 2021
In interactive IR (IIR), users often seek to achieve different goals (e.g. exploring a new topic, finding a specific known item) at different search iterations and thus may evaluate system performances differently. Without state-aware approach, it would be extremely difficult to simulate and achieve real-time adaptive search evaluation and recommendation. To address this gap, our work identifies users' task states from interactive search sessions and meta-evaluates a series of online and offline evaluation metrics under varying states based on a user study dataset consisting of 1548 unique query segments from 450 search sessions. Our results indicate that: 1) users' individual task states can be identified and predicted from search behaviors and implicit feedback; 2) the effectiveness of mainstream evaluation measures (measured based upon their respective correlations with user satisfaction) vary significantly across task states. This study demonstrates the implicit heterogeneity in user-oriented IR evaluation and connects studies on complex search tasks with evaluation techniques. It also informs future research on the design of state-specific, adaptive user models and evaluation metrics.
- Conference Article
27
- 10.1145/3397271.3401162
- Jul 25, 2020
Evaluation metrics play an important role in the batch evaluation of IR systems. Based on a user model that describes how users interact with the rank list, an evaluation metric is defined to link the relevance scores of a list of documents to an estimation of system effectiveness and user satisfaction. Therefore, the validity of an evaluation metric has two facets: whether the underlying user model can accurately predict user behavior and whether the evaluation metric correlates well with user satisfaction. While a tremendous amount of work has been undertaken to design, evaluate, and compare different evaluation metrics, few studies have explored the consistency between these two facets of evaluation metrics. Specifically, we want to investigate whether the metrics that are well calibrated with user behavior data can perform as well in estimating user satisfaction. To shed light on this research question, we compare the performance of various metrics with the C/W/L Framework in estimating user satisfaction when they are optimized to fit observed user behavior. Experimental results on both self-collected and public available user search behavior datasets show that the metrics optimized to fit users' click behavior can perform as well as those calibrated with user satisfaction feedback. We also investigate the reliability in the calibration process of evaluation metrics to find out how much data is required for parameter tuning. Our findings provide empirical support for the consistency between user behavior modeling and satisfaction measurement, as well as guidance for tuning the parameters in evaluation metrics.
- Research Article
69
- 10.1109/tmm.2020.2980944
- Mar 23, 2020
- IEEE Transactions on Multimedia
Despite the fact that automatic content analysis has made remarkable progress over the last decade - mainly due to significant advances in machine learning - interactive video retrieval is still a very challenging problem, with an increasing relevance in practical applications. The Video Browser Showdown (VBS) is an annual evaluation competition that pushes the limits of interactive video retrieval with state-of-the-art tools, tasks, data, and evaluation metrics. In this paper, we analyse the results and outcome of the 8th iteration of the VBS in detail. We first give an overview of the novel and considerably larger V3C1 dataset and the tasks that were performed during VBS 2019. We then go on to describe the search systems of the six international teams in terms of features and performance. And finally, we perform an in-depth analysis of the per-team success ratio and relate this to the search strategies that were applied, the most popular features, and problems that were experienced. A large part of this analysis was conducted based on logs that were collected during the competition itself. This analysis gives further insights into the typical search behavior and differences between expert and novice users. Our evaluation shows that textual search and content browsing are the most important aspects in terms of logged user interactions. Furthermore, we observe a trend towards deep learning based features, especially in the form of labels generated by artificial neural networks. But nevertheless, for some tasks, very specific content-based search features are still being used. We expect these findings to contribute to future improvements of interactive video search systems.
- Conference Article
36
- 10.1145/3077136.3080841
- Aug 7, 2017
The design of a Web search evaluation metric is closely related with how the user's interaction process is modeled. Each behavioral model results in a different metric used to evaluate search performance. In these models and the user behavior assumptions behind them, when a user ends a search session is one of the prime concerns because it is highly related to both benefit and cost estimation. Existing metric design usually adopts some simplified criteria to decide the stopping time point: (1) upper limit for benefit (e.g. RR, AP); (2) upper limit for cost (e.g. [email protected], [email protected]). However, in many practical search sessions (e.g. exploratory search), the stopping criterion is more complex than the simplified case. Analyzing benefit and cost of actual users' search sessions, we find that the stopping criteria vary with search tasks and are usually combination effects of both benefit and cost factors. Inspired by a popular computer game named Bejeweled, we propose a Bejeweled Player Model (BPM) to simulate users' search interaction processes and evaluate their search performances. In the BPM, a user stops when he/she either has found sufficient useful information or has no more patience to continue. Given this assumption, a new evaluation framework based on upper limits (either fixed or changeable as search proceeds) for both benefit and cost is proposed. We show how to derive a new metric from the framework and demonstrate that it can be adopted to revise traditional metrics like Discounted Cumulative Gain (DCG), Expected Reciprocal Rank (ERR) and Average Precision (AP). To show effectiveness of the proposed framework, we compare it with a number of existing metrics in terms of correlation between user satisfaction and the metrics based on a dataset that collects users' explicit satisfaction feedbacks and assessors' relevance judgements. Experiment results show that the framework is better correlated with user satisfaction feedbacks.
- Conference Article
2
- 10.1145/3372278.3390726
- Jun 8, 2020
Searching for memorized images in large datasets (known-item search) is a challenging task due to a limited effectiveness of retrieval models as well as limited ability of users to formulate suitable queries and choose an appropriate search strategy. A popular option to approach the task is to automatically detect semantic concepts and rely on interactive specification of keywords during the search session. Nonetheless, employed instances of such search models are often set arbitrarily in existing KIS systems as comprehensive evaluations with reals users are time demanding. This paper envisions and investigates an option to simulate keyword queries in a selected toy'' (yet competitive) keyword search model relying on a deep image classification network. Specifically, two properties of such keyword-based model are experimentally investigated with our known-item search benchmark dataset: which output transformation and ranking models are effective for the utilized classification model and whether there are some options for simulations of keyword queries. In addition to the main objective, the paper inspects also the effect of interactive query reformulations for the considered keyword search model.
- Research Article
11
- 10.1016/j.aiopen.2021.02.003
- Jan 1, 2020
- AI Open
User behavior modeling for Web search evaluation
- Conference Article
22
- 10.1145/3209978.3210059
- Jun 27, 2018
Comparing to general Web search engines, image search engines present search results differently, with two-dimensional visual image panel for users to scroll and browse quickly. These differences in result presentation can significantly impact the way that users interact with search engines, and therefore affect existing methods of search evaluation. Although different evaluation metrics have been thoroughly studied in the general Web search environment, how those offline and online metrics reflect user satisfaction in the context of image search is an open question. To shed light on this, we conduct a laboratory user study that collects both explicit user satisfaction feedbacks as well as user behavior signals such as clicks. Based on the combination of both externally assessed topical relevance and image quality judgments, offline image search metrics can be better correlated with user satisfaction than merely using topical relevance. We also demonstrate that existing offline Web search metrics can be adapted to evaluate on a two-dimensional presentation for image search. With respect to online metrics, we find that those based on image click information significantly outperform offline metrics. To our knowledge, our work is the first to thoroughly establish the relationship between different measures and user satisfaction in image search.
- Conference Article
13
- 10.1145/3209978.3210097
- Jun 27, 2018
User satisfaction has been paid much attention to in recent Web search evaluation studies and regarded as the ground truth for designing better evaluation metrics. However, most existing studies are focused on the relationship between satisfaction and evaluation metrics at query-level. However, while search request becomes more and more complex, there are many scenarios in which multiple queries and multi-round search interactions are needed (e.g. exploratory search). In those cases, the relationship between session-level search satisfaction and session search evaluation metrics remain uninvestigated. In this paper, we analyze how users' perceptions of satisfaction accord with a series of session-level evaluation metrics. We conduct a laboratory study in which users are required to finish some complex search tasks and provide usefulness judgments of documents as well as session-level and query level satisfaction feedbacks. We test a number of popular session search evaluation metrics as well as different weighting functions. Experiment results show that query-level satisfaction is mainly decided by the clicked document that they think the most useful (maximum effect). While session-level satisfaction is highly correlated with the most recently issued queries (recency effect). We further propose a number of criteria for designing better session search evaluation metrics.
- Conference Article
17
- 10.1145/3397271.3401163
- Jul 25, 2020
Recently session search evaluation has been paid more attention as a realistic search scenario usually involves multiple queries and interactions between users and systems. Evolved from model-based evaluation metrics for a single query, existing session-based metrics also follow a generic framework based on the cascade hypothesis. The cascade hypothesis assumes that lower-ranked search results and later-issued queries receive less attention from users and should therefore be assigned smaller weights when calculating evaluation metrics. This hypothesis gains much success in modeling search users' behavior and designing evaluation metrics, by explaining why users' attention decays on search engine result pages. However, recent studies have found that the recency effect also plays an important role in determining user satisfaction in search sessions. Especially, whether a user feels satisfied in the later-issued queries heavily influences his/her search satisfaction in the whole session. To take both the cascade hypothesis and the recency effect into the design of session search evaluation metrics, we propose Recency-aware Session-based Metrics (RSMs) to simultaneously characterize users' examination process with a browsing model and cognitive process with a utility accumulation model. With both self-constructed and public available user search behavior datasets, we show the effectiveness of proposed RSMs by comparing them with existing session-based metrics in the light of correlation with user satisfaction. We also find that the influence of the cascade and the recency effects varies dramatically among tasks with different difficulties and complexities, which suggests that we should use different model parameters for different types of search tasks. Our findings highlight the importance of investigating and utilizing cognitive effects besides examination hypotheses in search evaluation.
- Conference Article
94
- 10.1145/2600428.2609629
- Jul 3, 2014
Session search is a complex search task that involves multiple search iterations triggered by query reformulations. We observe a Markov chain in session search: user's judgment of retrieved documents in the previous search iteration affects user's actions in the next iteration. We thus propose to model session search as a dual-agent stochastic game: the user agent and the search engine agent work together to jointly maximize their long term rewards. The framework, which we term win-win is based on Partially Observable Markov Decision Process. We mathematically model dynamics in session search, including decision states, query changes, clicks, and rewards, as a cooperative game between the user and the search engine. The experiments on TREC 2012 and 2013 Session datasets show a statistically significant improvement over the state-of-the-art interactive search and session search algorithms.
- Conference Article
47
- 10.1145/3077136.3080804
- Aug 7, 2017
As in most information retrieval (IR) studies, evaluation plays an essential part in Web search research. Both offline and online evaluation metrics are adopted in measuring the performance of search engines. Offline metrics are usually based on relevance judgments of query-document pairs from assessors while online metrics exploit the user behavior data, such as clicks, collected from search engines to compare search algorithms. Although both types of IR evaluation metrics have achieved success, to what extent can they predict user satisfaction still remains under-investigated. To shed light on this research question, we meta-evaluate a series of existing online and offline metrics to study how well they infer actual search user satisfaction in different search scenarios. We find that both types of evaluation metrics significantly correlate with user satisfaction while they reflect satisfaction from different perspectives for different search tasks. Offline metrics better align with user satisfaction in homogeneous search (i.e. ten blue links) whereas online metrics outperform when vertical results are federated. Finally, we also propose to incorporate mouse hover information into existing online evaluation metrics, and empirically show that they better align with search user satisfaction than click-based online metrics.
- Conference Article
26
- 10.1145/3178876.3186065
- Jan 1, 2018
User satisfaction has been paid much attention to in recent Web search evaluation studies. Although satisfaction is often considered as an important symbol of search success, it doesn»t guarantee success in many cases, especially for complex search task scenarios. In this study, we investigate the differences between user satisfaction and search success, and try to adopt the findings to predict search success in complex search tasks. To achieve these research goals, we conduct a laboratory study in which search success and user satisfaction are annotated by domain expert assessors and search users, respectively. We find that both Satisfaction with Failure and Unsatisfied Success cases happen in these search tasks and together they account for as many as 40.3% of all search sessions. The factors (e.g. document readability and credibility) that lead to the inconsistency of search success and user satisfaction are also investigated and adopted to predict whether one search task is successful. Experimental results show that our proposed prediction method is effective in predicting search success.
- Research Article
40
- 10.1145/3445029
- Sep 1, 2021
- ACM Transactions on Information Systems
Conversational search systems, such as Google assistant and Microsoft Cortana, enable users to interact with search systems in multiple rounds through natural language dialogues. Evaluating such systems is very challenging, given that any natural language responses could be generated, and users commonly interact for multiple semantically coherent rounds to accomplish a search task. Although prior studies proposed many evaluation metrics, the extent of how those measures effectively capture user preference remain to be investigated. In this article, we systematically meta-evaluate a variety of conversational search metrics. We specifically study three perspectives on those metrics: (1) reliability : the ability to detect “actual” performance differences as opposed to those observed by chance; (2) fidelity : the ability to agree with ultimate user preference; and (3) intuitiveness : the ability to capture any property deemed important: adequacy, informativeness, and fluency in the context of conversational search. By conducting experiments on two test collections, we find that the performance of different metrics vary significantly across different scenarios, whereas consistent with prior studies, existing metrics only achieve weak correlation with ultimate user preference and satisfaction. METEOR is, comparatively speaking, the best existing single-turn metric considering all three perspectives. We also demonstrate that adapted session-based evaluation metrics can be used to measure multi-turn conversational search, achieving moderate concordance with user satisfaction. To our knowledge, our work establishes the most comprehensive meta-evaluation for conversational search to date.
- Research Article
568
- 10.1145/1059981.1059982
- Apr 1, 2005
- ACM Transactions on Information Systems
Of growing interest in the area of improving the search experience is the collection of implicit user behavior measures (implicit measures) as indications of user interest and user satisfaction. Rather than having to submit explicit user feedback, which can be costly in time and resources and alter the pattern of use within the search experience, some research has explored the collection of implicit measures as an efficient and useful alternative to collecting explicit measure of interest from users.This research article describes a recent study with two main objectives. The first was to test whether there is an association between explicit ratings of user satisfaction and implicit measures of user interest. The second was to understand what implicit measures were most strongly associated with user satisfaction. The domain of interest was Web search. We developed an instrumented browser to collect a variety of measures of user activity and also to ask for explicit judgments of the relevance of individual pages visited and entire search sessions. The data was collected in a workplace setting to improve the generalizability of the results.Results were analyzed using traditional methods (e.g., Bayesian modeling and decision trees) as well as a new usage behavior pattern analysis (“gene analysis”). We found that there was an association between implicit measures of user activity and the user's explicit satisfaction ratings. The best models for individual pages combined clickthrough, time spent on the search result page, and how a user exited a result or ended a search session (exit type/end action). Behavioral patterns (through the gene analysis) can also be used to predict user satisfaction for search sessions.
- Ask R Discovery
- Chat PDF