Language conditioned multi-scale visual attention networks for visual grounding

Haibo Yao,Lipeng Wang,Chengtao Cai,Wei Wang,Zhi Zhang,Xiaobing Shang

doi:10.1016/j.imavis.2024.105242

Abstract

Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Language conditioned multi-scale visual attention networks for visual grounding

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Similar Papers

Multi-scale visual attention for attribute disambiguation in zero-shot learning
Long Tian ... Hongwei Liu
Signal Processing: Image Communication | VOL. 103
Long Tian, et. al.Long Tian ... Hongwei Liu
04 Jan 2022
Signal Processing: Image Communication | VOL. 103

Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech.
Aparna Balagopalan ... Jekaterina Novikova
Frontiers in aging neuroscience | VOL. 13
Aparna Balagopalan, et. al.Aparna Balagopalan ... Jekaterina Novikova
27 Apr 2021
Frontiers in aging neuroscience | VOL. 13

Chest radiology report generation based on cross-modal multi-scale feature fusion
Yu Pan ... Qing-Song Huang
Journal of Radiation Research and Applied Sciences | VOL. 17
Yu Pan, et. al.Yu Pan ... Qing-Song Huang
13 Jan 2024
Journal of Radiation Research and Applied Sciences | VOL. 17

Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding.
Yue Liao ... Aixi Zhang
IEEE Transactions on Image Processing | VOL. 31
Yue Liao, et. al.Yue Liao ... Aixi Zhang
01 Jan 2021
IEEE Transactions on Image Processing | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Language conditioned multi-scale visual attention networks for visual grounding

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing