A Multi-Level Attention Model for Remote Sensing Image Captions

Yangyang Li,Ruijiao Liu,Ronghua Shang,Shuangkang Fang,Licheng Jiao

doi:10.3390/rs12060939

Abstract

The task of image captioning involves the generation of a sentence that can describe an image appropriately, which is the intersection of computer vision and natural language. Although the research on remote sensing image captions has just started, it has great significance. The attention mechanism is inspired by the way humans think, which is widely used in remote sensing image caption tasks. However, the attention mechanism currently used in this task is mainly aimed at images, which is too simple to express such a complex task well. Therefore, in this paper, we propose a multi-level attention model, which is a closer imitation of attention mechanisms of human beings. This model contains three attention structures, which represent the attention to different areas of the image, the attention to different words, and the attention to vision and semantics. Experiments show that our model has achieved better results than before, which is currently state-of-the-art. In addition, the existing datasets for remote sensing image captioning contain a large number of errors. Therefore, in this paper, a lot of work has been done to modify the existing datasets in order to promote the research of remote sensing image captioning.

Highlights

Transforming vision into language for human beings is a common scene in daily life
We do a lot of modification works on the existing remote sensing image caption datasets and make the modified datasets public
Inspired by the attention mechanism widely used in image tasks, we propose a multi-level attention model

Summary

Introduction

Transforming vision into language for human beings is a common scene in daily life. For example, when someone asks you, “what are you looking at?”, you might say, “I saw a bird flying over my head”. If we ask a robot, “how many cups are there on the table in front of you?”, it is very difficult for the robot to answer the question because it needs to understand our problems and needs to find the corresponding answers from the visual information, which is a very challenging task. The task of image captioning is to study how to generate a sentence that can describe an image appropriately, and VQA is the study of how to make an intelligent machine answer questions about a video, after the machine has watched a video. Both tasks study the translation of vision into language.

Methods

Results

Discussion

Conclusion