Abstract

Rephrasings or paraphrases are sentences with similar meanings expressed in different ways. Visual Question Answering (VQA) models are closing the gap with the oracle performance for datasets like VQA2.0. However, these models fail to perform well on rephrasings of a question, which raises some important questions like Are these models robust towards linguistic variations? Is it the architecture or the dataset that we need to optimize? In this paper, we analyzed VQA models in the space of paraphrasing. We explored the role of language & cross-modal pre-training to investigate the robustness of VQA models towards lexical variations. Our experiments find that pre-trained language encoders generate efficient representations of question rephrasings, which help VQA models correctly infer these samples. We empirically determine why pre-training language encoders improve lexical robustness. Finally, we observe that although pre-training all VQA components obtain state-of-the-art results on the VQA-Rephrasings dataset, it still fails to completely close the performance gap between original and rephrasing validation splits.

Highlights

  • Visual Question Answering (VQA) (Antol et al, 2015) is an image conditioned question answering task which has gained immense popularity in vision & language community

  • 1The work was done prior to joining Amazon. 2https://visualqa.org/challenge.html by introducing semantically rich visual features (Anderson et al, 2018), efficient attention schemes (Lu et al, 2016; Yang et al, 2016), and advance multimodal fusion techniques (Fukui et al, 2016; Yu et al, 2017). To deploy these state-of-the-art VQA models into real-world settings, the models must be robust to linguistic variations that originate from interactions with real users

  • We show that pre-trained language encoders make VQA models lexically robust

Read more

Summary

Introduction

Visual Question Answering (VQA) (Antol et al, 2015) is an image conditioned question answering task which has gained immense popularity in vision & language community. A majority of models obtained higher gains by introducing semantically rich visual features (Anderson et al, 2018), efficient attention schemes (Lu et al, 2016; Yang et al, 2016), and advance multimodal fusion techniques (Fukui et al, 2016; Yu et al, 2017). To deploy these state-of-the-art VQA models into real-world settings, the models must be robust to linguistic variations that originate from interactions with real users. BUTD is the base architecture for many other VQA architectures like Pythia (Jiang et al, 2018) and BAN (Kim et al, 2018)

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call