RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data

Yang Zhan,Yuan Yuan,Zhitong Xiong

doi:10.1109/tgrs.2023.3250471

Yang Zhan, Yuan Yuan + Show 1 more

Open Access

PDF Available

https://doi.org/10.1109/tgrs.2023.3250471

Copy DOI

Export

Save

Cite

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In this paper, we introduce the task of visual grounding for remote sensing data (RSVG). RSVG aims to localize the referred objects in remote sensing (RS) images with the guidance of natural language. To retrieve rich information from RS imagery using natural language, many research tasks, like RS image visual question answering, RS image captioning, and RS image-text retrieval have been investigated a lot. However, the object-level visual grounding on RS images is still under-explored. Thus, in this work, we propose to construct the dataset and explore deep learning models for the RSVG task. Specifically, our contributions can be summarized as follows. 1) We build the new large-scale benchmark dataset of RSVG, termed DIOR-RSVG, to fully advance the research of RSVG. This new dataset includes image/expression/box triplets for training and evaluating visual grounding models. 2) We benchmark extensive state-of-the-art (SOTA) natural image visual grounding methods on the constructed DIOR-RSVG dataset, and some insightful analyses are provided based on the results. 3) A novel transformer-based Multi-Granularity Visual Language Fusion (MGVLF) module is proposed. Remotely-sensed images are usually with large-scale variations and cluttered backgrounds. To deal with the scale-variation problem, the MGVLF module takes advantage of multi-scale visual features and multi-granularity textual embeddings to learn more discriminative representations. To cope with the cluttered background problem, MGVLF adaptively filters irrelevant noise and enhances salient features. In this way, our proposed model can incorporate more effective multi-level and multi-modal features to boost performance. This work can provide useful insights for developing better RSVG models.

Full Text