A Picture May Be Worth a Hundred Words for Visual Question Answering

Yusuke Hirota,Noa Garcia,Mayu Otani,Chenhui Chu,Yuta Nakashima

doi:10.3390/electronics13214290

Abstract

How far can textual representations go in understanding images? In image understanding, effective representations are essential. Deep visual features from object recognition models currently dominate various tasks, especially Visual Question Answering (VQA). However, these conventional features often struggle to capture image details in ways that match human understanding, and their decision processes lack interpretability. Meanwhile, the recent progress in language models suggests that descriptive text could offer a viable alternative. This paper investigated the use of descriptive text as an alternative to deep visual features in VQA. We propose to process description–question pairs rather than visual features, utilizing a language-only Transformer model. We also explored data augmentation strategies to enhance training set diversity and mitigate statistical bias. Extensive evaluation shows that textual representations using approximately a hundred words can effectively compete with deep visual features on both the VQA 2.0 and VQA-CP v2 datasets. Our qualitative experiments further reveal that these textual representations enable clearer investigation of VQA model decision processes, thereby improving interpretability.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Picture May Be Worth a Hundred Words for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Journal: Electronics	Publication Date: Oct 31, 2024
License type: CC BY 4.0

Similar Papers

Visual Question Answering with Textual Representations for Images
Yusuke Hirota ... Noa Garcia
-
Yusuke Hirota, et. al.Yusuke Hirota ... Noa Garcia
01 Oct 2021
01 Oct 2021

Learning Visual Knowledge Memory Networks for Visual Question Answering
Zhou Su ... Dongqi Cai
-
Zhou Su, et. al.Zhou Su ... Dongqi Cai
01 Jun 2018
01 Jun 2018

Learning Convolutional Text Representations for Visual Question Answering
Zhengyang Wang ... Shuiwang Ji
-
Zhengyang Wang, et. al.Zhengyang Wang ... Shuiwang Ji
07 May 2018
07 May 2018

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal ... Douglas Summers-Stay
International Journal of Computer Vision | VOL. 127
Yash Goyal, et. al.Yash Goyal ... Douglas Summers-Stay
11 Sep 2018
International Journal of Computer Vision | VOL. 127

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Picture May Be Worth a Hundred Words for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Electronics