Abstract

Visual grounding (VG) locates target objects in visual scenes by understanding given natural language queries. Current methods for VG mainly focus on grounding referring expressions or noun phrases covered in the labeled training samples. Despite their grounding prowess, these approaches struggle in grounding novel query–image pairs excluded from the training data. This shortage is usually caused by the deficiency of discriminative representation learning both in images and queries. To address these issues, we propose a one-stage coarse-to-fine framework for zero-shot VG to ground novel query-image samples. Specifically, in the coarse stage, we mine the global context information in the visual features and query embeddings by employing a multi-head self-attention block, strengthening the intra-modality relations in the visual and textual features. In the fine stage, we first learn the query-aware visual representations based on the acquired global context information via a multi-modal relation-enhanced transformer block, which explores the informative information by modeling the cross-modal interaction among visual and textual domains. We further excavate target-oriented discriminative representations from the acquired query-aware visual representations by a noun phrase-guided multi-modal interaction network, which augments the interaction between target-related phrases and the obtained query-aware visual representations to enhance the distinction of target regions, enhancing the subsequent referred target grounding and generalizing. In order to validate the proposed approach, we implement extensive experiments and ablation studies on public benchmark datasets, including RefCOCO, RefCOCO+, RefCOCOg, Flickr30K Entity, Flickr-Split-0 and Flickr-Split-1. Experimental results demonstrate that our approach substantially improves the grounding accuracy and achieves new state-of-the-art performance under single-stage training and testing.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.