Multi-level Attention Networks for Visual Question Answering

Dongfei Yu,Tao Mei,Yong Rui,Jianlong Fu

doi:10.1109/cvpr.2017.446

Abstract

Inspired by the recent success of text-based question answering, visual question answering (VQA) is proposed to automatically answer natural language questions with the reference to a given image. Compared with text-based QA, VQA is more challenging because the reasoning process on visual domain needs both effective semantic embedding and fine-grained visual understanding. Existing approaches predominantly infer answers from the abstract low-level visual features, while neglecting the modeling of high-level image semantics and the rich spatial context of regions. To solve the challenges, we propose a multi-level attention network for visual question answering that can simultaneously reduce the semantic gap by semantic attention and benefit fine-grained spatial inference by visual attention. First, we generate semantic concepts from high-level semantics in convolutional neural networks (CNN) and select those question-related concepts as semantic attention. Second, we encode region-based middle-level outputs from CNN into spatially-embedded representation by a bidirectional recurrent neural network, and further pinpoint the answer-related regions by multiple layer perceptron as visual attention. Third, we jointly optimize semantic attention, visual attention and question embedding by a softmax classifier to infer the final answer. Extensive experiments show the proposed approach outperforms the-state-of-arts on two challenging VQA datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-level Attention Networks for Visual Question Answering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Multi-source Multi-level Attention Networks for Visual Question Answering
Dongfei Yu ... Jianlong Fu
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15
Dongfei Yu, et. al.Dongfei Yu ... Jianlong Fu
30 Apr 2019
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 15

Answer Questions with Right Image Regions: A Visual Attention Regularization Approach
Yibing Liu ... Jianhua Yin
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18
Yibing Liu, et. al.Yibing Liu ... Jianhua Yin
04 Mar 2022
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 18

Static Correlative Filter Based Convolutional Neural Network for Visual Question Answering
Lijun Chen ... Qinyu Li
-
Lijun Chen, et. al.Lijun Chen ... Qinyu Li
01 Jan 2018
01 Jan 2018

Learning Convolutional Text Representations for Visual Question Answering
Zhengyang Wang ... Shuiwang Ji
-
Zhengyang Wang, et. al.Zhengyang Wang ... Shuiwang Ji
07 May 2018
07 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-level Attention Networks for Visual Question Answering

Abstract

Talk to us

Similar Papers