Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

Lichen Zhao,Lu Sheng,Dong Xu,Rui Zheng,Lipeng Wang,Daigang Cai,Xibo Fan,Jing Zhang,Yinjie Zhao

doi:10.1109/tcsvt.2022.3229081

Abstract

Recently, 3D vision-and-language tasks have attracted increasing research interest. Compared to other vision-and-language tasks, the 3D visual question answering (VQA) task is less exploited and is more susceptible to language priors and co-reference ambiguity. Meanwhile, a couple of recently proposed 3D VQA datasets do not well support 3D VQA task due to their limited scale and annotation methods. In this work, we formally define and address a 3D grounded VQA task by collecting a new 3D VQA dataset, referred to as FE-3DGQA, with diverse and relatively free-form question-answer pairs, as well as dense and completely grounded bounding box annotations. To achieve more explainable answers, we label the objects appeared in the complex QA pairs with different semantic types, including answer-grounded objects (both appeared and not appeared in the questions), and contextual objects for answer-grounded objects. We also propose a new 3D VQA framework to effectively predict the completely visually grounded and explainable answer. Extensive experiments verify that our newly collected bench-mark datasets can be effectively used to evaluate various 3D VQA methods from different aspects and our newly proposed framework also achieves the state-of-the-art performance on the new benchmark dataset. The datasets and the source code are available via https://github.com/zlccccc/3DVL_Codebase.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology

Lead the way for us

Journal: IEEE Transactions on Circuits and Systems for Video Technology	Publication Date: Jun 1, 2023
Citations: 7

Similar Papers

Coarse-to-Fine Reasoning for Visual Question Answering
Binh X Nguyen ... Anh Nguyen
-
Binh X Nguyen, et. al.Binh X Nguyen ... Anh Nguyen
01 Jun 2022
01 Jun 2022

Improving Automatic VQA Evaluation Using Large Language Models
Oscar Mañas ... Aishwarya Agrawal
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Oscar Mañas, et. al.Oscar Mañas ... Aishwarya Agrawal
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Explanation vs. attention: A two-player game to obtain attention for VQA and visual dialog
Badri N Patro ... Vinay P Namboodiri
Pattern Recognition | VOL. 132
Badri N Patro, et. al.Badri N Patro ... Vinay P Namboodiri
23 Jul 2022
Pattern Recognition | VOL. 132

Explanation vs Attention: A Two-Player Game to Obtain Attention for VQA
Badri Patro ... Vinay Namboodiri
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Badri Patro, et. al.Badri Patro ... Vinay Namboodiri
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Toward Explainable 3D Grounded Visual Question Answering: A New Benchmark and Strong Baseline

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Circuits and Systems for Video Technology