Visual question answering (VQA) research has garnered increasing attention in recent years. It is considered a visual Turing test because it requires a computer to respond to textual questions based on an image. Expertise in computer vision, natural language processing, knowledge understanding, and reasoning is required to solve the problem of VQA. Most techniques employed for VQA consist of models that are developed to learn the combination of image and question features along with the expected answer. The techniques chosen for image and question feature extraction and combining the features change with each model. This method of teaching a model of the question–answer pattern is ineffective for queries that involve counting and reasoning. This approach also requires considerable resources and large datasets for the training. The general VQA datasets feature a restricted number of items as responses to counting questions ([Formula: see text]), and the distribution of the answers is not uniform. To investigate these issues in VQA, we created synthetic datasets that could be modified to adjust the number of objects in the image and the amount of occlusion. Specifically, a zero-shot learning VQA system was devised for counting-related questions that provide answers by analyzing the output of an object detector and the query keywords. Using synthetic datasets, our model generated 100% correct results. Testing on the benchmark datasets task directed image understanding challenge (TDIUC) and TallyQA-simple indicated that the proposed model matched the performance of the learning-based baseline models. This methodology can be used efficiently for counting VQA questions confined to certain domains when the number of items to be counted is significant.
Read full abstract