Multimodal encoders and decoders with gate attention for visual question answering

Haiyan Li,Dezhi Han

doi:10.2298/csis201120032l

Haiyan Li, Dezhi Han

Open Access

https://doi.org/10.2298/csis201120032l

Copy DOI

Abstract

Visual Question Answering (VQA) is a multimodal research related to Computer Vision (CV) and Natural Language Processing (NLP). How to better obtain useful information from images and questions and give an accurate answer to the question is the core of the VQA task. This paper presents a VQA model based on multimodal encoders and decoders with gate attention (MEDGA). Each encoder and decoder block in the MEDGA applies not only self-attention and crossmodal attention but also gate attention, so that the new model can better focus on inter-modal and intra-modal interactions simultaneously within visual and language modality. Besides, MEDGA further filters out noise information irrelevant to the results via gate attention and finally outputs attention results that are closely related to visual features and language features, which makes the answer prediction result more accurate. Experimental evaluations on the VQA 2.0 dataset and the ablation experiments under different conditions prove the effectiveness of MEDGA. In addition, the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds many existing methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computer Science and Information Systems	Publication Date: Jan 1, 2021
Citations: 3	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

Multimodal encoders and decoders with gate attention for visual question answering

Abstract

Talk to us

Similar Papers

More From: Computer Science and Information Systems

Lead the way for us

Similar Papers

VQAR: Review on Information Retrieval Techniques based on Computer Vision and Natural Language Processing
Shivangi Modi ... Dhatri Pandya
-
Shivangi Modi, et. al.Shivangi Modi ... Dhatri Pandya
01 Mar 2019
01 Mar 2019

Computer Vision and Natural Language Processing
Peratham Wiriyathammabhum ... Yiannis Aloimonos
ACM Computing Surveys | VOL. 49
Peratham Wiriyathammabhum, et. al.Peratham Wiriyathammabhum ... Yiannis Aloimonos
12 Dec 2016
ACM Computing Surveys | VOL. 49

Visual question answering: Datasets, algorithms, and future challenges
Kushal Kafle ... Christopher Kanan
Computer Vision and Image Understanding | VOL. 163
Kushal Kafle, et. al.Kushal Kafle ... Christopher Kanan
13 Jun 2017
Computer Vision and Image Understanding | VOL. 163

Optimal Image Feature Ranking and Fusion for Visual Question Answering
Sruthy Manmadhan ... Binsu C Kovoor
-
Sruthy Manmadhan, et. al.Sruthy Manmadhan ... Binsu C Kovoor
09 Sep 2020
09 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multimodal encoders and decoders with gate attention for visual question answering

Abstract

Talk to us

Similar Papers

More From: Computer Science and Information Systems