Abstract
Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model’s attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model’s attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.
Highlights
Multi-modal learning tasks such as image captioning [1,2], image-text matching [3,4,5], and visual question answering (VQA) [6], which involve natural language processing and computer vision, have attracted considerable attention of researchers in these two fields
Considering many existing co-attention based VQA methods modeling dense interactions between each image region and each question word, which will force the models to calculate irrelevant information and have a negative impact on the performance of the models, modal Explicit Sparse Attention Networks (MESAN) reduces the interference from irrelevant information and focuses the attention of the model by using explicit selection based on top-k selection
Like other different sparse attention mechanisms used in natural language processing (NLP) and computer vision (CV) fields, our models achieve competitive results
Summary
Multi-modal learning tasks such as image captioning [1,2], image-text matching [3,4,5], and visual question answering (VQA) [6], which involve natural language processing and computer vision, have attracted considerable attention of researchers in these two fields. Compared with other multi-modal learning tasks, VQA is more difficult, since it requires the model to understand visual information, textual information, and the relationships between them simultaneously, and may require complex reasoning and commonsense knowledge to correctly answer the questions. A simple instance of a VQA dataset contains a visual image and a textual question related to the content of the image, which requires the model to predict the correct answer for the input image The perfect combination of sensors and deep learning enables the machine to have senses such as vision, hearing and smell, which makes it possible for high-value sensor data analysis and low-cost, real-time intelligent sensor systems.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.