SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

Haonan Luo,Yazhou Yao,Guosheng Lin,Zhenmin Tang,Fayao Liu,Zichuan Liu

doi:10.1109/iccv.2019.00976

Abstract

Embodied Question Answering (EQA) is a newly defined research area where an agent is required to answer the user's questions by exploring the real world environment. It has attracted increasing research interests due to its broad applications in automatic driving system, in-home robots, and personal assistants. Most of the existing methods perform poorly in terms of answering and navigation accuracy due to the absence of local details and vulnerability to the ambiguity caused by complicated vision conditions. To tackle these problems, we propose a segmentation based visual attention mechanism for Embodied Question Answering. Firstly, We extract the local semantic features by introducing a novel high-speed video segmentation framework. Then by the guide of extracted semantic features, a bottom-up visual attention mechanism is proposed for the Visual Question Answering (VQA) sub-task. Further, a feature fusion strategy is proposed to guide the training of the navigator without much additional computational cost. The ablation experiments show that our method boosts the performance of VQA module by 4.2% (68.99% vs 64.73%) and leads to 3.6% (48.59% vs 44.98%) overall improvement in EQA accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering.
Haonan Luo ... Yazhou Yao
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. PP
Haonan Luo, et. al.Haonan Luo ... Yazhou Yao
01 Jun 2023
IEEE Transactions on Pattern Analysis and Machine Intelligence | VOL. PP

Co-Attending Free-Form Regions and Detections With Multi-Modal Multiplicative Feature Embedding for Visual Question Answering
Pan Lu ... Jianyong Wang
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 32
Pan Lu, et. al.Pan Lu ... Jianyong Wang
27 Apr 2018
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 32

Accuracy vs. complexity: A trade-off in visual question answering models
Moshiur Farazi ... Nick Barnes
Pattern Recognition | VOL. 120
Moshiur Farazi, et. al.Moshiur Farazi ... Nick Barnes
12 Jun 2021
Pattern Recognition | VOL. 120

Image Segmentation Based on Visual Attention Mechanism
Qiaorong Zhang ... Huimin Xiao
Journal of Multimedia | VOL. 4
Qiaorong Zhang, et. al.Qiaorong Zhang ... Huimin Xiao
01 Dec 2009
Journal of Multimedia | VOL. 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering

Abstract

Talk to us

Similar Papers