FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Dongmei Zhang,Shanghang Zhang,Wei Xue,Shenghao Xie,Chang Li,Xiaodong Xie,Renrui Zhang

doi:10.1609/aaai.v38i15.29612

Abstract

The superior performances of pre-trained foundation models in various visual tasks underscore their potential to enhance the 2D models' open-vocabulary ability. Existing methods explore analogous applications in the 3D space. However, most of them only center around knowledge extraction from singular foundation models, which limits the open-vocabulary ability of 3D models. We hypothesize that leveraging complementary pre-trained knowledge from various foundation models can improve knowledge transfer from 2D pre-trained visual language models to the 3D space. In this work, we propose FM-OV3D, a method of Foundation Model-based Cross-modal Knowledge Blending for Open-Vocabulary 3D Detection, which improves the open-vocabulary localization and recognition abilities of 3D model by blending knowledge from multiple pre-trained foundation models, achieving true open-vocabulary without facing constraints from original 3D datasets. Specifically, to learn the open-vocabulary 3D localization ability, we adopt the open-vocabulary localization knowledge of the Grounded-Segment-Anything model. For open-vocabulary 3D recognition ability, We leverage the knowledge of generative foundation models, including GPT-3 and Stable Diffusion models, and cross-modal discriminative models like CLIP. The experimental results on two popular benchmarks for open-vocabulary 3D object detection show that our model efficiently learns knowledge from multiple foundation models to enhance the open-vocabulary ability of the 3D model and successfully achieves state-of-the-art performance in open-vocabulary 3D object detection tasks. Code is released at https://github.com/dmzhang0425/FM-OV3D.git.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 3

Similar Papers

Medical text classification based on the discriminative pre-training model and prompt-tuning.
Yu Wang ... Yuan Wang
DIGITAL HEALTH | VOL. 9
Yu Wang, et. al.Yu Wang ... Yuan Wang
01 Jan 2023
DIGITAL HEALTH | VOL. 9

Research on the Application of Prompt Learning Pretrained Language Model in Machine Translation Task with Reinforcement Learning
Canjun Wang ... Zhengyu Ju
Electronics | VOL. 12
Canjun Wang, et. al.Canjun Wang ... Zhengyu Ju
09 Aug 2023
Electronics | VOL. 12

Artificial intelligence foundation and pre-trained models: Fundamentals, applications, opportunities, and social impacts
Adam Kolides ... Yaser Jararweh
Simulation Modelling Practice and Theory | VOL. 126
Adam Kolides, et. al.Adam Kolides ... Yaser Jararweh
22 Mar 2023
Simulation Modelling Practice and Theory | VOL. 126

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI
Mohamad Ballout ... Kai-Uwe Kühnberger
Procedia Computer Science | VOL. 222
Mohamad Ballout, et. al.Mohamad Ballout ... Kai-Uwe Kühnberger
01 Jan 2023
Procedia Computer Science | VOL. 222

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence