A survey of methods, datasets and evaluation metrics for visual question answering

Himanshu Sharma,Anand Singh Jalal

doi:10.1016/j.imavis.2021.104327

Abstract

Visual Question Answering (VQA) is a multi-disciplinary research problem that has captured the attention of both computer vision as well as natural language processing researchers. In Visual Question Answering, a system is given an image; a question in a natural language related to that image as an input, and the VQA system is required to give an answer in natural language as an output. A VQA algorithm may require common sense reasoning over the information contained in the image and world knowledge to produce the right answer. In this paper, we have discussed some of the core concepts used in VQA systems and present a comprehensive survey of efforts in the past to address this problem. Apart from traditional VQA models, we have also discussed visual question answering models that require reading texts present in images and evaluated on recently developed datasets like TextVQA, ST-VQA, and OCR-VQA. Apart from standard datasets discussed in previous surveys, we have also discussed some new datasets developed in 2019 and 2020 such as GQA, OK-VQA, TextVQA, ST-VQA, and OCR-VQA. The new evaluation metrics such as BLEU, MPT, METEOR, Average Normalized Levenshtein Similarity (ANLS), Validity, Plausibility, Distribution, Consistency, Grounding, F1-Score are explained together with the evaluation metrics discussed by previous surveys. We conclude our survey with a discussion on open issues in each phase of the VQA task and present some promising future directions.

Full Text