Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.

Zihan Guo,Dezhi Han

doi:10.3390/s20236758

Abstract

Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model’s attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model’s attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.

Highlights

Multi-modal learning tasks such as image captioning [1,2], image-text matching [3,4,5], and visual question answering (VQA) [6], which involve natural language processing and computer vision, have attracted considerable attention of researchers in these two fields
Considering many existing co-attention based VQA methods modeling dense interactions between each image region and each question word, which will force the models to calculate irrelevant information and have a negative impact on the performance of the models, modal Explicit Sparse Attention Networks (MESAN) reduces the interference from irrelevant information and focuses the attention of the model by using explicit selection based on top-k selection
Like other different sparse attention mechanisms used in natural language processing (NLP) and computer vision (CV) fields, our models achieve competitive results

Summary

Introduction

Multi-modal learning tasks such as image captioning [1,2], image-text matching [3,4,5], and visual question answering (VQA) [6], which involve natural language processing and computer vision, have attracted considerable attention of researchers in these two fields. Compared with other multi-modal learning tasks, VQA is more difficult, since it requires the model to understand visual information, textual information, and the relationships between them simultaneously, and may require complex reasoning and commonsense knowledge to correctly answer the questions. A simple instance of a VQA dataset contains a visual image and a textual question related to the content of the image, which requires the model to predict the correct answer for the input image The perfect combination of sensors and deep learning enables the machine to have senses such as vision, hearing and smell, which makes it possible for high-value sensor data analysis and low-cost, real-time intelligent sensor systems.

Objectives

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors (Basel, Switzerland)	Publication Date: Nov 26, 2020
Citations: 12	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)

Lead the way for us

Similar Papers

VQAR: Review on Information Retrieval Techniques based on Computer Vision and Natural Language Processing
Shivangi Modi ... Dhatri Pandya
-
Shivangi Modi, et. al.Shivangi Modi ... Dhatri Pandya
01 Mar 2019
01 Mar 2019

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering
Yash Goyal ... Douglas Summers-Stay
International Journal of Computer Vision | VOL. 127
Yash Goyal, et. al.Yash Goyal ... Douglas Summers-Stay
11 Sep 2018
International Journal of Computer Vision | VOL. 127

Multi-view Visual Question Answering Dataset for Real Environment Applications
Yue Qiu ... Kenji Iwata
-
Yue Qiu, et. al.Yue Qiu ... Kenji Iwata
01 Jan 2020
01 Jan 2020

Visual question answering: Datasets, algorithms, and future challenges
Kushal Kafle ... Christopher Kanan
Computer Vision and Image Understanding | VOL. 163
Kushal Kafle, et. al.Kushal Kafle ... Christopher Kanan
13 Jun 2017
Computer Vision and Image Understanding | VOL. 163

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors (Basel, Switzerland)