A Survey on Visual Question Answering

Mrinal Banchhor,Pradeep Singh

doi:10.1109/gcat52182.2021.9587797

Abstract

Visual Question Answering (VQA) is an inspiring task that includes two major fields of AI namely Natural Language Processing and Computer Vision in which an image is given with a question in natural language and the task is to find the answer to that question. The Visual Question Answering (VQA) task involves challenges in which we have to process the data with visual as well as linguistic processing to find the answer of basic common sense questions related to a given image. This task requires reasoning capabilities on visual features and objects of the image along with that general knowledge to predict the correct answer to the given question. In this survey paper, we will discuss various state-of-the-art methodologies, algorithms, and datasets for VQA and breakthrough timeline in Visual Question Answering in detail. Some common techniques of combining recurrent and convolutional neural networks with LSTM or GRU in order to map questions as well as the image to a common interface are also explored. Each dataset contains questions of a distinct level of complexity. Different types of reasoning and capabilities are required to solve various complex images. In this survey, we will also cover current VQA datasets released in this field along with various types of question patterns and machine learning (ML)models. Finally, DL models which shown excellent performance on various benchmark VQA datasets are discussed.

Full Text