Abstract

Visual Question Answering (VQA) is a task that connects the fields of Computer Vision and Natural Language Processing. Taking as input an image I and a natural language question Q about I, a VQA model must be able to produce a coherent answer R (also in natural language) to Q. A particular type of visual question is one in which the question is binary (i.e., a question whose answer belongs to the set {yes, no}). Currently, deep neural networks correspond to the state of the art technique for training of VQA models. Despite its success, the application of neural networks to the VQA task requires a very large amount of data in order to produce models with adequate precision. Datasets currently used for the training of VQA models are the result of laborious manual labeling processes (i.e., made by humans). This context makes relevant the study of approaches to augment these datasets in order to train more accurate prediction models. This paper describes a crowdsourcing tool which can be used in a collaborative manner to augment an existing VQA dataset for binary questions. Our tool actively integrates candidate items from an external data source in order to optimize the selection of queries to be presented to curators.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call