Reasoning Over History: Context Aware Visual Dialog

Muhammad Shah,Tejas Srinivasan,Shikib Mehri

doi:10.18653/v1/2020.nlpbt-1.9

Abstract

While neural models have been shown to exhibit strong performance on single-turn visual question answering (VQA) tasks, extending VQA to a multi-turn, conversational setting remains a challenge. One way to address this challenge is to augment existing strong neural VQA models with the mechanisms that allow them to retain information from previous dialog turns. One strong VQA model is the MAC network, which decomposes a task into a series of attention-based reasoning steps. However, since the MAC network is designed for single-turn question answering, it is not capable of referring to past dialog turns. More specifically, it struggles with tasks that require reasoning over the dialog history, particularly coreference resolution. We extend the MAC network architecture with Context-aware Attention and Memory (CAM), which attends over control states in past dialog turns to determine the necessary reasoning operations for the current question. MAC nets with CAM achieve up to 98.25% accuracy on the CLEVR-Dialog dataset, beating the existing state-of-the-art by 30% (absolute). Our error analysis indicates that with CAM, the model’s performance particularly improved on questions that required coreference resolution.

Highlights

Visual dialog is the task of answering a sequence of questions about a given image such that responding to any one question in the dialog requires context from the previous dialog history
We experiment with three different combinations of our dialog-specific extensions to the MAC network architecture: (i) context-aware attention over control states, (ii) multi-turn memory, and (iii) concatenating the dialog history as input to MAC - an obvious but naive and inefficient strategy for incorporating contextual information into a single-turn QA model
We see that vanilla MAC achieves 10% higher accuracy than Neural Module Networks (NMNs), which is history-agnostic, and is surprisingly close to CorefNMN, which explicitly reasons over the dialog history

Summary

Introduction

Visual dialog is the task of answering a sequence of questions about a given image such that responding to any one question in the dialog requires context from the previous dialog history. Visual coreference resolution requires both an ability to reason over coreferences in the dialog, as well as ground the entities from the language modality in the visual one. In contrast to large-scale realistic datasets for visual dialog, such as VisDial (Das et al, 2017b), Kottur et al (2019) introduce CLEVR-Dialog as a diagnostic dataset for visual dialog. In contrast to other visual dialog datasets, CLEVR-Dialog is synthetically generated - this allows it to be both large-scale and structured in nature. This diagnostic dataset allows for improved fine-grained analysis, using the structured nature of the images and language. This fine-grained analysis allows researchers to study the different components in isolation and identify bottlenecks in end-to-end systems for visual dialog

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Reasoning Over History: Context Aware Visual Dialog

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 14	License type: cc-by

Similar Papers

Can Pre-training help VQA with Lexical Variations?
Shailza Jolly ... Shubham Kapoor
-
Shailza Jolly, et. al.Shailza Jolly ... Shubham Kapoor
01 Jan 2020
01 Jan 2020

Quantifying and Alleviating the Language Prior Problem in Visual Question Answering
Yangyang Guo ... Yibing Liu
-
Yangyang Guo, et. al.Yangyang Guo ... Yibing Liu
18 Jul 2019
18 Jul 2019

Knowledge-Routed Visual Question Reasoning: Challenges for Deep Representation Embedding.
Qingxing Cao ... Liang Lin
IEEE Transactions on Neural Networks and Learning Systems | VOL. 33
Qingxing Cao, et. al.Qingxing Cao ... Liang Lin
01 Jan 2020
IEEE Transactions on Neural Networks and Learning Systems | VOL. 33

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Reasoning Over History: Context Aware Visual Dialog

Abstract

Highlights

Summary

Talk to us

Similar Papers