Abstract
3D visual grounding is the task of accurately locating objects in a three-dimensional scene based on textual descriptions. Current approaches mainly depend on downsampling and extracting the point features for fusion with text features. However, the main challenges of these methods are the poor point feature resolution and limited local context during multi-modal fusion, causing visual-linguistic misalignment, particularly for small objects described in the text. The intuitive solution is to get additional object-related point features, gathering more contextual information to augment the representation capability of multimodal features, thereby promoting the representation capabilities of multimodal features. Based on this, we introduce a novel 3D visual grounding framework named Context-aware Feature Aggregation (CFA). The CFA framework includes two key modules: (1) Point Augmented Aggregation Module (PAM), designed to compensate for downsampling-induced information loss by augmenting sampled points with neighboring context for more discriminative features; and (2) Dual Contextual Grouping Attention Module (DCGAM), which iteratively refines features and geometry coordinates from PAM, capturing more global context. We assess the performance of our CFA framework on two point-based datasets: ScanRefer and Nr3D/Sr3D. The CFA framework exhibits efficiency in 3D visual grounding, surpassing the performance of previous methods by experimental results.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.