Abstract

Referring expression grounding is an important and challenging task in computer vision. To avoid the laborious annotation in conventional referring grounding, unpaired referring grounding is introduced, where the training data only contains a number of images and queries without correspondences. The few existing solutions to unpaired referring grounding are still preliminary, due to the challenges of learning vision-language correlation and lack of the top-down guidance with unpaired data. Existing works are only able to learn vision-language correlation by modality conversion, where critical information are lost. They also heavily rely on pre-extracted object proposals and thus cannot generate correct predictions with defective proposals.In this paper, we propose a novel bidirectional cross-modal matching (BiCM) framework to address these challenges. Particularly, we design a query-aware attention map (QAM) module that introduces top-down perspective via generating query-specific visual attention maps to avoid the over-reliance on pre-extracted object proposals. A cross-modal object matching (COM) module is further introduced to predict the target objects from a bottom-up perspective. This module exploits the recently emerged image-text matching pretrained model, CLIP, to learn cross-modal correlation without modality conversion. The top-down and bottom-up predictions are then integrated via a similarity fusion (SF) module. We also propose a knowledge adaptation matching (KAM) module that leverages unpaired training data to adapt pretrained knowledge to the target dataset and task. Experiments show that our framework significantly outperforms previous works on five grounding datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call