Abstract

Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. A VQA model combines visual and textual features in order to answer questions grounded in an image. Current works in VQA focus on questions which are answerable by direct analysis of the question and image alone. We present a concept-aware algorithm, ConceptBert, for questions which require common sense, or basic factual knowledge from external structured content. Given an image and a question in natural language, ConceptBert requires visual elements of the image and a Knowledge Graph (KG) to infer the correct answer. We introduce a multi-modal representation which learns a joint Concept-Vision-Language embedding inspired by the popular BERT architecture. We exploit ConceptNet KG for encoding the common sense knowledge and evaluate our methodology on the Outside Knowledge-VQA (OK-VQA) and VQA datasets.

Highlights

  • Visual Question Answering (VQA) was firstly introduced to bridge the gap between natural language processing and image understanding applications in the joint space of vision and language (Malinowski and Fritz, 2014).Most VQA benchmarks compute a question representation using word embedding techniques and Recurrent Neural Networks (RNNs), and a set of object descriptors comprising bounding box coordinates and image features vectors

  • Adding the Knowledge Graph (KG) embeddings to the model leads to a gain of 11.56% and 7.19% in VQA and Outside Knowledge-VQA (OK-VQA) datasets, respectively

  • Since we report our results on the validation set, we removed the validation set from the training phase, so that the model only relies on the training set

Read more

Summary

Introduction

Visual Question Answering (VQA) was firstly introduced to bridge the gap between natural language processing and image understanding applications in the joint space of vision and language (Malinowski and Fritz, 2014).Most VQA benchmarks compute a question representation using word embedding techniques and Recurrent Neural Networks (RNNs), and a set of object descriptors comprising bounding box coordinates and image features vectors. Word and image representations are fused and fed to a network to train a VQA model. These approaches are practical when no knowledge beyond the visual content is required. Incorporating the external knowledge introduces several advantages. External knowledge and supporting facts can improve the relational representation between the objects detected in the image, or between entities in the question and objects in the image. It provides information on how the answer can be derived from the question. The complexity of the questions can be increased based on the supporting knowledge base

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call