A Survey on Representation Learning in Visual Question Answering

Manish Sahani,Shailender Kumar,Sachin Jangpangi,Priyadarshan Singh

doi:10.1007/978-3-030-82469-3_29

Abstract

Visual question answering stands among the most researched computer vision problems, pattern recognition, and natural language processing. VQA extends the computer vision world’s challenges and directs us toward developing some basic reasonings on visual scenes to answer questions on the specific elements, actions, and relationships between different objects in the image. Developing reasonings on the image has always been popular among computer vision and natural language processing researchers. It is directly dependent on the expressivity of the representations learned from the datasets. In the past decade, with advancements in computing machinery, neural networks, and the introduction of highly optimized and efficient software, a substantial amount of research has been done to solve VQA efficiently. In this survey, we present an in-depth examination of representation learning of state-of-the-art methods proposed in the literature of VQA and compare them to discuss the future directions in the field.

Full Text