LAM: Remote Sensing Image Captioning with Label-Attention Mechanism

Zhengyuan Zhang,Wenkai Zhang,Xin Gao,Wenhui Diao,Xian Sun,Menglong Yan

doi:10.3390/rs11202349

Abstract

Significant progress has been made in remote sensing image captioning by encoder-decoder frameworks. The conventional attention mechanism is prevalent in this task but still has some drawbacks. The conventional attention mechanism only uses visual information about the remote sensing images without considering using the label information to guide the calculation of attention masks. To this end, a novel attention mechanism, namely Label-Attention Mechanism (LAM), is proposed in this paper. LAM additionally utilizes the label information of high-resolution remote sensing images to generate natural sentences to describe the given images. It is worth noting that, instead of high-level image features, the predicted categories’ word embedding vectors are adopted to guide the calculation of attention masks. Representing the content of images in the form of word embedding vectors can filter out redundant image features. In addition, it can also preserve pure and useful information for generating complete sentences. The experimental results from UCM-Captions, Sydney-Captions and RSICD demonstrate that LAM can improve the model’s performance for describing high-resolution remote sensing images and obtain better S m scores compared with other methods. S m score is a hybrid scoring method derived from the AI Challenge 2017 scoring method. In addition, the validity of LAM is verified by the experiment of using true labels.

Highlights

Image captioning is a task aiming at generating natural language sentences to comprehensively describe the content of images
With the rapid development of remote sensing technology, quantities of remote sensing images can be accessed. This convenience stimulates the demand for semantically understanding remote sensing images and the demand of retrieving specific images in a large number of images
The method “SAT” stands for the “Show-Attend-and-Tell” model “Models(LAM)” means that models adopt Label-Attention Mechanism instead of the conventional attention mechanism, and “models(LAM-TL)” means that ground-truth labels have only been utilized in the testing phase

Summary

Introduction

Image captioning is a task aiming at generating natural language sentences to comprehensively describe the content of images. With the rapid development of remote sensing technology, quantities of remote sensing images can be accessed. When a specific demand, such as searching for “warships in the harbor”, needs to be satisfied In this situation, both “warship” and “harbor” are needed, and the relationship between these two objects is “in”. The remote sensing image captioning task can well solve the above problem and provide help for managing remote sensing data. What is more, this challenging task plays an important role in many fields, such as scene understanding, image retrieval and military intelligence generation [5]

Methods

Results

Conclusion