Abstract

Phrase grounding aims to map textual phrases to their associated image regions, which can be a prerequisite for multimodal reasoning and can benefit tasks requiring identifying objects based on language. With pre-trained vision-and-language models achieving impressive performance across tasks, it remains unclear if we can directly utilize their learned embeddings for phrase grounding without fine-tuning. To this end, we propose a method to extract matched phrase-region pairs from pre-trained vision-and-language embeddings and propose four fine-tuning objectives to improve the model phrase grounding ability using image-caption data without any supervised grounding signals. Experiments on two representative datasets demonstrate the effectiveness of our objectives, outperforming baseline models in both weakly-supervised and supervised phrase grounding settings. In addition, we evaluate the aligned embeddings on several other downstream tasks and show that we can achieve better phrase grounding without sacrificing representation generality.

Highlights

  • In this paper, we study the phrase grounding ability of vision-and-language embeddings pre-trained on image-caption datasets

  • Few existing papers have paid attention to the phrase grounding ability of their pretrained embeddings, namely the ability to map natural language queries to their corresponding image ing fine-tuning for better phrase grounding while maintaining the representation transferability so that the learned representations are still useful for other downstream tasks

  • We fine-tune models with 1) a masked language modeling objective conditioned on images; 2) an adapted masked region modeling objective with texts utilizing a dynamically constructed vision vocabulary; 3) a modified object label prediction objective that explicitly bridges the gap between vision and language; 4) a proposed bidirectional attention optimization objective encouraging the consistency between visionto-language and language-to-vision alignments

Read more

Summary

Extracting Phrase-Region Pairs from Pre-Trained Embeddings

We first propose a way to directly extract the matched phrase-region pairs from pre-trained embeddings. We evaluate this method on phrase grounding tasks with several popular pre-trained models, including LXMERT (Tan and Bansal, 2019), UNITER (Chen et al, 2020), ViLBERT (Lu et al, 2019), VisualBERT (Li et al, 2019) and VLBERT (Su et al, 2019). All the vision-andlanguage models are pre-trained on a pruned Conceptual Captions dataset (Sharma et al, 2018), consisting of 2.77M images with weakly-associated captions automatically collected from billions of web pages. The image features are extracted using a Faster R-CNN (Ren et al, 2016) with a ResNet101 backbone (Anderson et al, 2018) trained on the Visual Genome dataset (Krishna et al, 2017) and the vision-and-language models are trained with 36 extracted regions of interest

Results
Aligning Pre-trained Vision-and-Language Embeddings
Fine-tuning Objectives
Experiments
Analysis
Conclusion
C Phrase Grounding Abilities Across Layers

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.