Revisiting 3D visual grounding with Context-aware Feature Aggregation

Peng Guo,Hongyuan Zhu,Hancheng Ye,Taihao Li,Tao Chen

doi:10.1016/j.neucom.2024.128195

Abstract

3D visual grounding is the task of accurately locating objects in a three-dimensional scene based on textual descriptions. Current approaches mainly depend on downsampling and extracting the point features for fusion with text features. However, the main challenges of these methods are the poor point feature resolution and limited local context during multi-modal fusion, causing visual-linguistic misalignment, particularly for small objects described in the text. The intuitive solution is to get additional object-related point features, gathering more contextual information to augment the representation capability of multimodal features, thereby promoting the representation capabilities of multimodal features. Based on this, we introduce a novel 3D visual grounding framework named Context-aware Feature Aggregation (CFA). The CFA framework includes two key modules: (1) Point Augmented Aggregation Module (PAM), designed to compensate for downsampling-induced information loss by augmenting sampled points with neighboring context for more discriminative features; and (2) Dual Contextual Grouping Attention Module (DCGAM), which iteratively refines features and geometry coordinates from PAM, capturing more global context. We assess the performance of our CFA framework on two point-based datasets: ScanRefer and Nr3D/Sr3D. The CFA framework exhibits efficiency in 3D visual grounding, surpassing the performance of previous methods by experimental results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Revisiting 3D visual grounding with Context-aware Feature Aggregation

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

DIG-SLAM: an accurate RGB-D SLAM based on instance segmentation and geometric clustering for dynamic indoor scenes
Rongguang Liang ... Zhenyu Guo
Measurement Science and Technology | VOL. 35
Rongguang Liang, et. al.Rongguang Liang ... Zhenyu Guo
27 Sep 2023
Measurement Science and Technology | VOL. 35

Multimodal Data Fusion with Quantum Inspiration
Qiuchi Li
-
Qiuchi LiQiuchi Li
18 Jul 2019
18 Jul 2019

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search.
Shuting He ... Xudong Jiang
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP
Shuting He, et. al.Shuting He ... Xudong Jiang
01 Jan 2024
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP

Dual attention deep learning network for automatic steel surface defect segmentation
Y Pan ... L Zhang
Computer-Aided Civil and Infrastructure Engineering | VOL. 37
Y Pan, et. al.Y Pan ... L Zhang
19 Nov 2021
Computer-Aided Civil and Infrastructure Engineering | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Revisiting 3D visual grounding with Context-aware Feature Aggregation

Abstract

Talk to us

Similar Papers

More From: Neurocomputing