Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding.

Yue Liao,Si Liu,Aixi Zhang,Tianrui Hui,Zhiyuan Chen

doi:10.1109/tip.2022.3181516

Yue Liao, Si Liu + Show 3 more

https://doi.org/10.1109/tip.2022.3181516

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Visual grounding is a task to localize an object described by a sentence in an image. Conventional visual grounding methods extract visual and linguistic features isolatedly and then perform cross-modal interaction in a post-fusion manner. We argue that this post-fusion mechanism does not fully utilize the information in two modalities. Instead, it is more desired to perform cross-modal interaction during the extraction process of the visual and linguistic feature. In this paper, we propose a language-customized visual feature learning mechanism where linguistic information guides the extraction of visual feature from the very beginning. We instantiate the mechanism as a one-stage framework named Progressive Language-customized Visual feature learning (PLV). Our proposed PLV consists of a Progressive Language-customized Visual Encoder (PLVE) and a grounding module. We customize the visual feature with linguistic guidance at each stage of the PLVE by Channel-wise Language-guided Interaction Modules (CLIM). Our proposed PLV outperforms conventional state-of-the-art methods with large margins across five visual grounding datasets without pre-training on object detection datasets, while achieving real-time speed. The source code is available in the supplementary material.

Full Text