Abstract

3D Visual Grounding (3DVG) is a prevalent multi-modal information fusion task capable of accurately localizing target objects referenced in natural language descriptions within a point cloud scene. Nevertheless, the stringent demands for input and output devices present substantial hurdles for the application and integration of 3DVG in fields like remote robotic control and telemedicine. To address this challenge, we introduce several innovative approaches. Firstly, we have initiated a novel multi-modal task, termed 3D Visual Grounding-Audio (3DVG-Audio), which is based on the fusion of audio and point cloud. To the best of our knowledge, this represents the first instance of an Audio-Point Cloud multi-modal task. 3DVG-Audio achieves precise localization of audio-mentioned objects within the point cloud by utilizing the point cloud in conjunction with the corresponding audio input. Secondly, building upon the ScanRefer, we have developed a novel dataset named 3DVG-AudioSet, specifically designed for the training and evaluation of the 3DVG-Audio method. Finally, we have crafted a tailored loss function to further enhance the performance of 3DVG-Audio and introduced a method named AP-Refer, which serves as a benchmark for the task. Extensive experimental results demonstrate the potential for deep integration of audio and point cloud to tackle complex real-world challenges. AP-Refer has successfully addressed the 3DVG-Audio, circumventing the limitations of conventional 3DVG methods, and exhibits significant application potential.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.