Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Wentao Mo,Yang Liu

doi:10.1609/aaai.v38i5.28222

Abstract

In 3D Visual Question Answering (3D VQA), the scarcity of fully annotated data and limited visual content diversity hampers the generalization to novel scenes and 3D concepts (e.g., only around 800 scenes are utilized in ScanQA and SQA dataset). Current approaches resort supplement 3D reasoning with 2D information. However, these methods face challenges: either they use top-down 2D views that introduce overly complex and sometimes question-irrelevant visual clues, or they rely on globally aggregated scene/image-level representations from 2D VLMs, losing the fine-grained vision-language correlations. To overcome these limitations, our approach utilizes question-conditional 2D view selection procedure, pinpointing semantically relevant 2D inputs for crucial visual clues. We then integrate this 2D knowledge into the 3D-VQA system via a two-branch Transformer structure. This structure, featuring a Twin-Transformer design, compactly combines 2D and 3D modalities and captures fine-grained correlations between modalities, allowing them mutually augmenting each other. Integrating proposed mechanisms above, we present BridgeQA, that offers a fresh perspective on multi-modal transformer-based architectures for 3D-VQA. Experiments validate that BridgeQA achieves state-of-the-art on 3D-VQA datasets and significantly outperforms existing solutions. Code is available at https://github.com/matthewdm0816/BridgeQA.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

A novel 2D and 3D multimodal approach for in-the-wild facial expression recognition
Thai Son Ly ... Guee-Sang Lee
Image and Vision Computing | VOL. 92
Thai Son Ly, et. al.Thai Son Ly ... Guee-Sang Lee
28 Oct 2019
Image and Vision Computing | VOL. 92

Combine EfficientNet and CNN for 3D model classification.
Xue-Yao Gao ... Chun-Xiang Zhang
Mathematical biosciences and engineering : MBE | VOL. 20
Xue-Yao Gao, et. al.Xue-Yao Gao ... Chun-Xiang Zhang
01 Jan 2023
Mathematical biosciences and engineering : MBE | VOL. 20

3D Face Recognition
Mohsen Ardabilian ... Przemyslaw Szeptycki
-
Mohsen Ardabilian, et. al.Mohsen Ardabilian ... Przemyslaw Szeptycki
01 Aug 2012
01 Aug 2012

Quality of Visual Experience for 3D Presentation - Stereoscopic Image
Junyong You ... Andrew Perkis
-
Junyong You, et. al.Junyong You ... Andrew Perkis
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence