Abstract

Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., ‘yeah’ and ‘yes’). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as ‘I don’t know.’ Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.

Highlights

  • 1. can't tell it's covered in cloth 2. it appears to be a large red pillow that may be leather

  • Prior works focus on optimizing a single lowing, we describe two steps: (i) the mean reciprocal rank (MRR) step metric (Guo et al, 2019; Jiang et al, 2020; Hu responsible for preserving the human-derived rank high, and (ii) the normalized discounted cumulative gain (NDCG) step responsible for rank- MRR model is not certain

  • When the NDCG model and the MRR model agree that a candidate is likely to be correct, it implies that both the NDCG and MRR metrics gain by ranking this candidate high

Read more

Summary

Related Work

Visual conversation evaluation: Early attempts to marry conversation with vision used street scene images, and binary questions (Geman et al, 2015). A different approach suggested in the VQA dataset focus only on brief, mostly 1-word answers (Antol et al, 2015). Schwartz et al (2019b) propose a model, namely Factor Graph Attention (FGA), that lets all entities (e.g., question-words, image-regions, answer-candidate, and caption-words) interact to infer an attention map for each modality. Murahari et al (2020) recently propose Large-Scale(LS) model, which pre-trains on related vision-language datasets, e.g., Conceptual Captions and Visual Question Answering(Sharma et al, 2018; Antol et al, 2015). Prior works focus on optimizing a single lowing, we describe two steps: (i) the MRR step metric (Guo et al, 2019; Jiang et al, 2020; Hu responsible for preserving the human-derived rank high, and (ii) the NDCG step responsible for rank- MRR model is not certain. We we add the MRR-models’ answer at first retrieval

Top answers
NDCG-agreement answers
High-certainty answers
NDCG step
Conclusions
A Qualitative Analysis
Findings
UNK bright blue

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.