An Attention-Guided Visual Semantic Fusion Method for Remote Sensing Image Captioning

Tingting Wei,Wanpeng Zhang,Weilin Yuan,Junren Luo

doi:10.1007/978-981-19-6052-9_78

Abstract

AbstractAs a multi-modal task, remote sensing image captioning (RSIC) plays an essential role in image understanding. However, most current RSIC methods ignore sufficient utilization of image features. we propose an improved image description method with improved attention mechanism, named attention-guided visual semantic fusion (AVSF) method for RSIC, based on the encoder-decoder framework. Attention mechanism contributes to obtaining the relationship between global and local image, as well as the relationship between image region and corresponding text, which is more conducive to the generation of description. Firstly, convolutional neural network VGG16 is introduced as the main network to construct the encoder and extract the features for remote sensing images. Secondly, image features are fused with text features, and attention mechanism is introduced to enhance the attention to vision features. Then, long short-term memory is used to build a decoder for learning the mapping relationships between image visual features and text semantic features. The experiments show that evaluation metrics have been improved on the UCM and Sydney datasets.KeywordsImage captioningRemote sensingFeature fusionAttention mechanism

Full Text