GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Haozhan Shen,Jianwei Yin,Tiancheng Zhao,Mingwei Zhu

doi:10.1609/aaai.v38i5.28278

Abstract

Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Journal: Proceedings of the AAAI Conference on Artificial Intelligence	Publication Date: Mar 24, 2024
Citations: 1

Similar Papers

Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task
Ali J Ghandour ... Mustafa Shukor
-
Ali J Ghandour, et. al.Ali J Ghandour ... Mustafa Shukor
08 Mar 2024
08 Mar 2024

Cycle-Consistent Weakly Supervised Visual Grounding With Individual and Contextual Representations.
Ruisong Zhang ... Cheng-Lin Liu
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. 32
Ruisong Zhang, et. al.Ruisong Zhang ... Cheng-Lin Liu
01 Jan 2023
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. 32

Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts.
Alex Jinpeng Wang ... Shuicheng Yan
IEEE transactions on pattern analysis and machine intelligence | VOL. 46
Alex Jinpeng Wang, et. al.Alex Jinpeng Wang ... Shuicheng Yan
01 May 2024
IEEE transactions on pattern analysis and machine intelligence | VOL. 46

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui ... Marcus Rohrbach
-
Akira Fukui, et. al.Akira Fukui ... Marcus Rohrbach
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence