Flexible Sentence Analysis Model for Visual Question Answering Network

Wei Deng,Guanghao Jin,Shengbei Wang,Jianming Wang

doi:10.1145/3278198.3278207

Abstract

Nowadays, visual question answering (VQA) has attracted much attention in both computer vision and natural language processing. Generally, a VQA system adopts sentence analysis models that decompose the sentence to short parts to analyze the user's attempt and merge partial results to get final answer. Despite the success of those models, the correct analysis of long length questions still remains as a key problem in VQA case. Especially, when a sentence produces comprehensive deviation due to different situation or customs of questioners, the sentence analysis model might output a wrong answer and lead to severe performance drop of the VQA system. To tackle the problem, a new sentence comprehension model has been proposed in this paper. The model is named flexible analysis model and is mainly used to deal with the sentences related to object counting. In human dialogue case, when the first answer went wrong, people would change a way to comprehend the sentence for finding the correct answer. Inspired by the mechanism, the flexible sentence analysis model tries another different way to comprehend the sentence after the sentence is given a wrong number answer, and the VQA system can generate a new answer according to the new output. Our model was tested on CLEVR dataset, and the experiment result shows that our method improved the accuracy nearly 10.5% in long sentence cases. It proves that our network has better performance on both correctness and robustness.

Full Text