Abstract

Recent advances in the field of NLP (Natural Language Processing) and CV (Computer Vision) have sparked a lot of curiosity among researchers to test the limitations of latest Deep learning techniques by employing them in more complex AI tasks. One such kind of task is VQA (Visual Question Answering) which is inherently divided into many layers of complexities. Some questions are simple having obvious answers while some are more complex which need logical reasoning, common sense and factual knowledge. Starting simple and gradually incorporating complexity, is always a good idea in scientific research and development. At first, datasets were simpler consisting of simple question-answer pairs with images depicting simpler concepts and relatively naive VQA models were trained on them. Slowly, with time, the VQA datasets got more complicated and tangled demanding more cognitive capabilities from VQA models. This evolution pushed the VQA models to be more efficient in matching human cognitive abilities, using reasoning based on common sense and factual knowledge. In this survey, we will first discuss some of the famous datasets in the domain of VQA and then we will discuss some of the crucial advancements in the VQA architectures and what is currently being done for integrating common sense and knowledge into these models. Moreover, reasoning is very crucial for truly intelligent systems but representations in deep learning models are inherently very fuzzy and vague. We need models that can transparently generate reasoning about their predictions like old school expert systems which used to work on symbolic knowledge, so the architectures based on the amalgam of deep learning techniques and Symbolic representations would also be a part of our discussion. We will also shed some light on the impact of transformers in the field of deep learning and how these transformer based models are quickly becoming state-of-the-art in almost every deep learning task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call