Abstract

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

Highlights

  • Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI)

  • After fine-tuning, the proposed model achieved an accuracy of 94.01% on the unseen real image dataset, outperforming SRN_FiLM by 11.39%

  • We proposed a multi-view visual question answering (VQA) framework that actively chooses observation viewpoints to answer questions

Read more

Summary

Introduction

Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI). In real-world environments, because it is challenging to take photographs continuously from optimal viewpoints, objects can be greatly occluded and answering questions based on single-view images could be difficult. Qiu et al [9] proposed a multi-view VQA framework that uses perimeter viewpoint observation for answering questions. We built a computer graphics (CG) multi-view VQA dataset with 12 viewpoints For this dataset, the proposed framework achieved accuracy comparable to that of a state-of-the-art method [9]. We conduct experiments on a multi-view VQA dataset that consists of real images. This dataset can be used to evaluate the generalization ability of VQA methods. The proposed framework shows high performance for this dataset, which indicates the suitability of our framework for realistic settings

Visual Question Answering
Learned Scene Representation
Deep Q-Learning Networks
Embodied Question Answering
Approach
Scene Representation
Viewpoint Selection
Implementation Details
Experiments with CG Images
Method
Training on CG Images and Testing on the Real Images Dataset
Fine-Tuning on the Semi-CG Dataset
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.