Abstract

Traditional techniques for visual question answering (VQA) are mostly end-to-end neural network based, which often perform poorly (e.g., low accuracy) due to lack of understanding and reasoning. To overcome the weaknesses, we propose a comprehensive approach with following key features. (1) It represents inputs, i.e., an image I and a natural language question Qnl as an entity-attribute graph and a query graph, respectively, and employs pattern matching to find answers; (2) it leverages reinforcement learning based model to identify a set of policies that are used to guide visual tasks and construct an entity-attribute graph, based on Qnl; (3) it employs a novel method to parse a question Qnl and generate corresponding query graph Q(uo) for pattern matching; and (4) it integrates inference scheme to further improve result accuracy, in particular, it learns a graph-structured classifier for missing value inference and a co-occurrence matrix for candidate selection. With these features, our approach can not only process visual tasks efficiently, but also answer questions with high accuracy. To evaluate the performance of our approach, we conduct empirical studies on Soccer, Visual-Genome and GQA, and show that our approach outperforms the state-of-the-art methods in result accuracy and system efficiency.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.