Multiscale Multiinteraction Network for Remote Sensing Image Captioning

Yong Wang,Xin Gao,Wenkai Zhang,Zhengyuan Zhang,Xian Sun

doi:10.1109/jstars.2022.3153636

Abstract

Much of the recent work in remote sensing image captioning is influenced by natural image captioning. These methods tend to fix the defects of the model architecture to improve the previous work, but pay little attention to the differences between remote sensing images and natural images. By considering these differences, we propose a multiscale multiinteraction remote sensing image captioning model. As in Fig. 1(a), the targets in remote sensing images have a wide range of scales; while the natural images are generally taken close-up, resulting in a similar scale for the foreground targets. Due to the difference in shooting methods, the model pretrained on close-up natural images cannot capture multiscale remote sensing targets well. To alleviate this problem, we propose a two-stage multiscale structure for feature representation, where we first finetune the CNN backbone on remote sensing images for domain adaption, then we collect features from different stages as the multiscale feature representation. Moreover, due to the shooting distance, the height information of the target in the remote sensing image is greatly weakened, thus some objects like low plants and grasses become difficult to identify, as in Fig. 1(b). Thus, we further propose a multiinteraction feature representation module, where information flow of the same and different layers could effectively interact. By calculating the similarity score among features, we fuse features with high similarity, and increase the distance between features of different categories, thereby enhancing the distinguishability. Results on RSICD, Sydney-Captions, and UCM-Captions show a clear improvement over the compared methods.

Highlights

I MAGE captioning lies in the intersection of natural language processing (NLP) and computer vision (CV), which is used to generate natural language description for the input image based on its content
Lu et al [2] collected a larger dataset for remote sensing image captioning, which is more in line with the characteristics of remote sensing data and they tested a variety of baseline models borrowed from natural image captioning task
As previous work tends to adopt similar methods as natural images to perform remote sensing image captioning (e.g., VGG [14], ResNet [15] pre-trained on natural images are used for image features extraction, LSTM [16] is further applied for text generation), a question arises: Is remote sensing image captioning the same as natural image captioning? We argue that some characteristics of remote sensing images should be considered when we describe them: 1) Backbone networks trained on natural images are generally for close-up shooting of the target objects, so the network mainly focuses on objects that occupy most space of the image

Summary

Introduction

I MAGE captioning lies in the intersection of natural language processing (NLP) and computer vision (CV), which is used to generate natural language description for the input image based on its content. Different from the low-level vision tasks, such as remote sensing image classification [4], [5], object/saliency detection [6], [7] and semantic/instance segmentation [8], [9], remote sensing image captioning requires identifying the categories and regions of objects in the image, and inferring relation of them, and fluently expressed in natural language. Inspired by natural image captioning methods [10]–[12], many recent works [1], [2], [13] adopted the basic encoder-decoder pipeline for remote sensing image captioning, e.g., Qu et al [1] first proposed to leverage the CNN-RNN architecture to describe the remote sensing images. Lu et al [2] collected a larger dataset for remote sensing image captioning, which is more in line with the characteristics of remote sensing data and they tested a variety of baseline models borrowed from natural image captioning task

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing	Publication Date: Jan 1, 2022
Citations: 23	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multiscale Multiinteraction Network for Remote Sensing Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

Lead the way for us

Similar Papers

Remote sensing image caption generation via transformer and reinforcement learning
Xiangqing Shen ... Jiaqi Zhao
Multimedia Tools and Applications | VOL. 79
Xiangqing Shen, et. al.Xiangqing Shen ... Jiaqi Zhao
17 Jul 2020
Multimedia Tools and Applications | VOL. 79

Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning
Xiangqing Shen ... Mingming Liu
Knowledge-Based Systems | VOL. 203
Xiangqing Shen, et. al.Xiangqing Shen ... Mingming Liu
23 Apr 2020
Knowledge-Based Systems | VOL. 203

Meta captioning: A meta learning based remote sensing image captioning framework
Qiaoqiao Yang ... Peng Ren
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 186
Qiaoqiao Yang, et. al.Qiaoqiao Yang ... Peng Ren
25 Feb 2022
ISPRS Journal of Photogrammetry and Remote Sensing | VOL. 186

A Multi-Level Attention Model for Remote Sensing Image Captions
Yangyang Li ... Ruijiao Liu
Remote Sensing | VOL. 12
Yangyang Li, et. al.Yangyang Li ... Ruijiao Liu
13 Mar 2020
Remote Sensing | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multiscale Multiinteraction Network for Remote Sensing Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing