Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Pan Lu,Hongsheng Li,Xiaogang Wang,Wei Zhang,Jianyong Wang

doi:10.1609/aaai.v32i1.12240

Abstract

Recently, the Visual Question Answering (VQA) task has gained increasing attention in artificial intelligence. Existing VQA methods mainly adopt the visual attention mechanism to associate the input question with corresponding image regions for effective question answering. The free-form region based and the detection-based visual attention mechanisms are mostly investigated, with the former ones attending free-form image regions and the latter ones attending pre-specified detection-box regions. We argue that the two attention mechanisms are able to provide complementary information and should be effectively integrated to better solve the VQA problem. In this paper, we propose a novel deep neural network for VQA that integrates both attention mechanisms. Our proposed framework effectively fuses features from free-form image regions, detection boxes, and question representations via a multi-modal multiplicative feature embedding scheme to jointly attend question-related free-form image regions and detection boxes for more accurate question answering. The proposed method is extensively evaluated on two publicly available datasets, COCO-QA and VQA, and outperforms state-of-the-art approaches. Source code is available at https://github.com/lupantech/dual-mfa-vqa.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Apr 27, 2018
Citations: 43

Similar Papers

ALSA: Adversarial Learning of Supervised Attentions for Visual Question Answering.
Yun Liu ... Bo Zhang
IEEE transactions on cybernetics | VOL. 52
Yun Liu, et. al.Yun Liu ... Bo Zhang
11 Nov 2020
IEEE transactions on cybernetics | VOL. 52

Local relation network with multilevel attention for visual question answering
Bo Sun ... Lejun Yu
Journal of Visual Communication and Image Representation | VOL. 73
Bo Sun, et. al.Bo Sun ... Lejun Yu
20 Jan 2020
Journal of Visual Communication and Image Representation | VOL. 73

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features
Mayank Agrawal ... Anand Singh Jalal
Computational Intelligence | VOL. 40
Mayank Agrawal, et. al.Mayank Agrawal ... Anand Singh Jalal
21 Dec 2023
Computational Intelligence | VOL. 40

Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.
Zihan Guo ... Dezhi Han
Sensors (Basel, Switzerland) | VOL. 20
Zihan Guo, et. al.Zihan Guo ... Dezhi Han
26 Nov 2020
Sensors (Basel, Switzerland) | VOL. 20

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence