Abstract

We propose a tactic of advancing Visual Question Answering (VQA) task by incorporating 3D information via multi-view images. Conventional VQA approaches, which reply an answer in words against a linguistic question about a given RGB image, have less ability to recognize geometrical information so that they tend to fail to count things or guess positional relationship. Moreover, they have no ability to determine blinded space, so it is not feasible to invent VQA function to robots which will work in highly-occluded real-world environments. To achieve the situation, we introduce a new multi-view VQA dataset along with an approach that incorporating 3D scene information directly captured from multi-view images into VQA without using depth images or employing SLAM. Our proposed approach achieves strong performance with an overall accuracy of 95.4% on the challenging multi-view VQA dataset setup, which contains relatively severe occlusion. This work also demonstrates the promising aspects of bridging the gap between 3D vision and language.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call