Visual Question Answering With Dense Inter- and Intra-Modality Interactions

Fei Liu,Hanqing Lu,Richang Hong,Jing Liu,Zhiwei Fang

doi:10.1109/tmm.2020.3026892

Abstract

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of inter-modality interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Besides, most methods neglect the modeling of the intra-modality interactions that is also important to VQA. In this work, we propose a novel DenIII framework for modeling dense inter- and intra-modality interactions. It densely connects all pairwise layers of the network via the proposed Inter- and Intra-modality Attention Connectors, capturing fine-grained interplay across all hierarchical levels. The Inter-modality Attention Connector efficiently connects the multi-modality features at any two layers with bidirectional attention, capturing the inter-modality interactions. While the Intra-modality Attention Connector connects the features of the same modality with unidirectional attention, and models the intra-modality interactions. Extensive ablation studies and visualizations validate the effectiveness of our method, and DenIII achieves state-of-the-art or competitive performance on three publicly available datasets.

Full Text