Multi-View Visual Question Answering with Active Viewpoint Selection.

Yue Qiu,Ryota Suzuki,Yutaka Satoh,Kenji Iwata,Hirokatsu Kataoka

doi:10.3390/s20082281

Yue Qiu, Ryota Suzuki + Show 3 more

Open Access

PDF Available

https://doi.org/10.3390/s20082281

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

Highlights

Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI)
After fine-tuning, the proposed model achieved an accuracy of 94.01% on the unseen real image dataset, outperforming SRN_FiLM by 11.39%
We proposed a multi-view visual question answering (VQA) framework that actively chooses observation viewpoints to answer questions

Summary

Introduction

Recent developments in deep neural networks have resulted in significant technological advancements and have broadened the applicability of human–robot interaction (HRI). In real-world environments, because it is challenging to take photographs continuously from optimal viewpoints, objects can be greatly occluded and answering questions based on single-view images could be difficult. Qiu et al [9] proposed a multi-view VQA framework that uses perimeter viewpoint observation for answering questions. We built a computer graphics (CG) multi-view VQA dataset with 12 viewpoints For this dataset, the proposed framework achieved accuracy comparable to that of a state-of-the-art method [9]. We conduct experiments on a multi-view VQA dataset that consists of real images. This dataset can be used to evaluate the generalization ability of VQA methods. The proposed framework shows high performance for this dataset, which indicates the suitability of our framework for realistic settings

Visual Question Answering

Learned Scene Representation

Deep Q-Learning Networks

Embodied Question Answering

Approach

Scene Representation

Viewpoint Selection

Implementation Details

Experiments with CG Images

Method

Training on CG Images and Testing on the Real Images Dataset

Fine-Tuning on the Semi-CG Dataset

Conclusions

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Apr 17, 2020
Citations: 11	License type: CC BY 4.0

R Discovery Prime

Multi-View Visual Question Answering with Active Viewpoint Selection.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Multi-view Visual Question Answering Dataset for Real Environment Applications
Yue Qiu ... Ryota Suzuki
-
Yue Qiu, et. al.Yue Qiu ... Ryota Suzuki
01 Jan 2020
01 Jan 2020

Human-Robot Full-Sentence VQA Interaction System with Highway Memory Network
Sanghyun Cho ... Jin-Man Park
-
Sanghyun Cho, et. al.Sanghyun Cho ... Jin-Man Park
16 Jun 2019
16 Jun 2019

Implementation of human-robot VQA interaction system with dynamic memory networks
Sanghyun Cho ... Won-Hyong Lee
-
Sanghyun Cho, et. al.Sanghyun Cho ... Won-Hyong Lee
01 Oct 2017
01 Oct 2017

Neural Networks for Detecting Irrelevant Questions During Visual Question Answering
Mengdi Li ... Stefan Wermter
-
Mengdi Li, et. al.Mengdi Li ... Stefan Wermter
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Multi-View Visual Question Answering with Active Viewpoint Selection.

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Sensors