Improving Visual Question Answering by Referring to Generated Paragraph Captions

Hyounghun Kim,Mohit Bansal

doi:10.18653/v1/p19-1351

Abstract

Paragraph-style image captions describe diverse aspects of an image as opposed to the more common single-sentence captions that only provide an abstract description of the image. These paragraph captions can hence contain substantial information of the image for tasks such as visual question answering. Moreover, this textual information is complementary with visual information present in the image because it can discuss both more abstract concepts and more explicit, intermediate symbolic information about objects, events, and scenes that can directly be matched with the textual question and copied into the textual answer (i.e., via easier modality match). Hence, we propose a combined Visual and Textual Question Answering (VTQA) model which takes as input a paragraph caption as well as the corresponding image, and answers the given question based on both inputs. In our model, the inputs are fused to extract related information by cross-attention (early fusion), then fused again in the form of consensus (late fusion), and finally expected answers are given an extra score to enhance the chance of selection (later fusion). Empirical results show that paragraph captions, even when automatically generated (via an RL-based encoder-decoder model), help correctly answer more visual questions. Overall, our joint model, when trained on the Visual Genome dataset, significantly improves the VQA performance over a strong baseline model.

Highlights

Understanding visual information along with natural language have been studied in different ways
In visual question answering (VQA) (Antol et al, 2015; Goyal et al, 2017; Lu et al, 2016; Fukui et al, 2016; Xu and Saenko, 2016; Yang et al, 2016; Zhu et al, 2016; Anderson et al, 2018), models are trained to choose the correct answer given a question about an image
Similar to the VQA task, image captioning models should learn the relationship between partial areas in an image and the generated words or phrases

Summary

Introduction

Understanding visual information along with natural language have been studied in different ways. Similar to the VQA task, image captioning models should learn the relationship between partial areas in an image and the generated words or phrases. While these two tasks seem to have different directions, they have the same purpose: understanding visual information with language. Paragraph-style descriptive captions can more explicitly (via intermediate symbolic representations) explain what objects are in the image and their relationships, and VQA questions can be answered more by matching the textual information with the questions. C 2019 Association for Computational Linguistics these paragraph captions and attribute sentences as input in addition to the standard input image features.

Related Work

Paragraph Captioning Model

Features

Three Fusion Levels

VQA baseline

TextQA with GenP

Conclusion

Findings

A Attention Visualization

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 37	License type: cc-by

Similar Papers

Estimating Viewed Images with Natural Language Question Answering from fMRI Data
Saya Takada ... Ren Togo
-
Saya Takada, et. al.Saya Takada ... Ren Togo
01 Mar 2020
01 Mar 2020

Inner Knowledge-based Img2Doc Scheme for Visual Question Answering
Qun Li ... Bir Bhanu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18
Qun Li, et. al.Qun Li ... Bir Bhanu
04 Mar 2022
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18

MobiVQA
Qingqing Cao ... Nicholas D. Lane
Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies | VOL. 6
Qingqing Cao, et. al.Qingqing Cao ... Nicholas D. Lane
04 Jul 2022
Proceedings of the ACM on interactive, mobile, wearable and ubiquitous technologies | VOL. 6

Visual Question Generation as Dual Task of Visual Question Answering
Yikang Li ... Xiao Chu
-
Yikang Li, et. al.Yikang Li ... Xiao Chu
01 Jun 2018
01 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Abstract

Highlights

Summary

Talk to us

Similar Papers