Multi-Modality Latent Interaction Network for Visual Question Answering

Gao Peng,Haoxuan You,Zhanpeng Zhang,Hongsheng Li,Xiaogang Wang

doi:10.1109/iccv.2019.00592

Abstract

Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations are propagated to fuse valuable information from both modalities and are used to update the visual and word features. Such MLI modules can be stacked for several stages to model complex and latent relations between the two modalities and achieves highly competitive performance on public VQA benchmarks, VQA v2.0 and TDIUC . In addition, we show that the performance of our methods could be significantly improved by combining with pre-trained language model BERT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Modality Latent Interaction Network for Visual Question Answering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An Entropy Clustering Approach for Assessing Visual Question Difficulty
Kento Terao ... Toru Tamaki
IEEE Access | VOL. 8
Kento Terao, et. al.Kento Terao ... Toru Tamaki
01 Jan 2020
IEEE Access | VOL. 8

Improving Visual Question Answering by Referring to Generated Paragraph Captions
Hyounghun Kim ... Mohit Bansal
-
Hyounghun Kim, et. al.Hyounghun Kim ... Mohit Bansal
01 Jan 2019
01 Jan 2019

Positional Attention Guided Transformer-Like Architecture for Visual Question Answering
Aihua Mao ... Jun Xuan
IEEE Transactions on Multimedia | VOL. 25
Aihua Mao, et. al.Aihua Mao ... Jun Xuan
01 Jan 2023
IEEE Transactions on Multimedia | VOL. 25

Multi-stage reasoning on introspecting and revising bias for visual question answering
L An-An ... Liu Min
ACM Transactions on the Web | VOL. 18
L An-An, et. al.L An-An ... Liu Min
08 Oct 2024
ACM Transactions on the Web | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Modality Latent Interaction Network for Visual Question Answering

Abstract

Talk to us

Similar Papers