Abstract

AbstractVisual question answering (VQA) system is an integrative research problem in the field of artificial intelligence. An image and a textual query are given as input to the VQA system. It tries to find the correct answer by combining the image and deductions collected from input textual queries. It is essential to interpret and retrieve the accurate answers from the visual reasoning queries. Recent studies have made use of parse tree construction on input queries which leads to poor performance due to the lack of semantic interpretation. This work is proposed to achieve comprehensive reasoning by following a semantic representation of the parsed tree construction. The proposed model, semantic tree-based visual question answering system (STVQA) captures the inherent visual evidence of every word parsed from the textual query and combines the visual evidence of its child nodes. The result obtained is transported to the parent nodes in the parse tree. Thus, the STVQA proposed system aims to fulfil global reasoning interpretation from the image and textual query. The VQA system is applicable to various domains such as image retrieval system, surveillance and hence acts as an aid for visually impaired people. The STVQA system is explored on a publicly available benchmark challenging dataset: CLEVR. It is shown that the model is computationally efficient and data-efficient and achieving a new state-of-the-art 90% accuracy.KeywordsNatural language processingSemanticParse tree

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call