Abstract

Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image–question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.

Highlights

  • Visual question answering (VQA) is an intersecting field of computer vision and natural language processing, which has just been proposed in recent years

  • The image features extracted by convolution neural network (CNN) are fused with the encoded text features, and the fused features are fed to the artificial neural network

  • We propose a reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems

Read more

Summary

Introduction

Visual question answering (VQA) is an intersecting field of computer vision and natural language processing, which has just been proposed in recent years. The image features extracted by convolution neural network (CNN) are fused with the encoded text features, and the fused features are fed to the artificial neural network These methods turn the visual question answering into a multi-label classification task. We propose a reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. The attention model uses the word encodings of the current tree node, and fuses the attention map of the child node with the relationship between words from the knowledge base to extract local visual evidence for explicit reasoning. The attention map of each node is considered as a qualitative experimental result in the process of explicit visual reasoning, which further shows that our model is interpretable and has strong adaptability to different tasks.

Visual Question Answering
Knowledge Base
Neural Module Network
Approach
Overview
Attention Model
Reasoning Model
Knowledge-Based Fact Model
Answer Prediction
Datasets
Implementation Details
Comparison with Existing Methods
Visual Reasoning
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call