3D Visual Grounding-Audio: 3D scene object detection based on audio

Can Zhang,Zeyu Cai,Xunhao Chen,Feipeng Da,Shaoyan Gai

doi:10.1016/j.neucom.2024.128637

Abstract

3D Visual Grounding (3DVG) is a prevalent multi-modal information fusion task capable of accurately localizing target objects referenced in natural language descriptions within a point cloud scene. Nevertheless, the stringent demands for input and output devices present substantial hurdles for the application and integration of 3DVG in fields like remote robotic control and telemedicine. To address this challenge, we introduce several innovative approaches. Firstly, we have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio (3DVG-Audio), which is based on the fusion of audio and point cloud. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task. 3DVG-Audio achieves precise localization of audio-mentioned objects within the point cloud by utilizing the point cloud in conjunction with the corresponding audio input. Secondly, building upon the ScanRefer, we have developed a novel dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method. Finally, we have crafted a tailored loss function to further enhance the performance of 3DVG-Audio and introduced a method named AP-Refer, which serves as a benchmark for the task. Extensive experimental results demonstrate the potential for deep integration of audio and point cloud to tackle complex real-world challenges. AP-Refer has successfully addressed the 3DVG-Audio, circumventing the limitations of conventional 3DVG methods, and exhibits significant application potential.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

3D Visual Grounding-Audio: 3D scene object detection based on audio

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

Scene point cloud understanding and reconstruction technologies in 3D space
Jingyu Gong ... Xin Tan
Journal of Image and Graphics | VOL. 28
Jingyu Gong, et. al.Jingyu Gong ... Xin Tan
01 Jan 2023
Journal of Image and Graphics | VOL. 28

MIA-Net: Multi-Modal Interactive Attention Network for Multi-Modal Affective Analysis
Shuzhen Li ... C L Philip Chen
IEEE Transactions on Affective Computing | VOL. 14
Shuzhen Li, et. al.Shuzhen Li ... C L Philip Chen
01 Oct 2023
IEEE Transactions on Affective Computing | VOL. 14

Density-Imbalance-Eased LiDAR Point Cloud Upsampling via Feature Consistency Learning
Tso-Yuan Chen ... Ching-Chun Huang
IEEE Transactions on Intelligent Vehicles | VOL. 8
Tso-Yuan Chen, et. al.Tso-Yuan Chen ... Ching-Chun Huang
01 Apr 2023
IEEE Transactions on Intelligent Vehicles | VOL. 8

A New Semantic Segmentation Method of Point Cloud Based on PointNet and VoxelNet
Weihang Zhou ... Junguo Lu
-
Weihang Zhou, et. al.Weihang Zhou ... Junguo Lu
01 Jun 2019
01 Jun 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

3D Visual Grounding-Audio: 3D scene object detection based on audio

Abstract

Talk to us

Similar Papers

More From: Neurocomputing