VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning

Zhengyuan Zhang,Wenhui Diao,Xin Gao,Wenkai Zhang,Menglong Yan,Xian Sun

doi:10.1109/access.2019.2942154

Zhengyuan Zhang, Wenhui Diao + Show 4 more

Open Access

https://doi.org/10.1109/access.2019.2942154

Copy DOI

Abstract

Owing to the effectiveness in selectively focusing on regions of interest of images, the attention mechanism has been widely used in image caption task, which can provide more accurate image information for training deep sequential models. Existing attention-based models typically rely on top-down attention mechanism. While somewhat effective, attention masks in these attention-based models are queried from image features by hidden states of LSTM, rather than optimized by the objective functions. This indirectly supervised training approach cannot ensure that attention layers accurately focus on regions of interest. To address the above issue, in this paper, a novel attention model, Visual Aligning Attention model (VAA), is proposed. In this model, the attention layer is optimized by a well-designed visual aligning loss during the training phase. The visual aligning loss is obtained by explicitly calculating the feature similarity of attended image features and corresponding word embedding vectors. Besides, in order to eliminate the influence of non-visual words in training the attention layer, a visual vocab used for filtering out non-visual words in sentences is proposed, which can neglect the non-visual words when calculating the visual aligning loss. Experiments on UCM-Captions and Sydney-Captions prove that the proposed method is more effective in remote sensing image caption task.

Highlights

Image captioning is a complicated task that bridges both the visual and linguistic domains
It is noteworthy that the proposed Visual Aligning Attention model can obtain the best scores on all the metrics
3) RESULTS ANALYSIS From Tables 1-2, it is easy to find that CSMLF method gets the worst performance on UCM-Captions and SydneyCaptions. This indicates that the encoder-decoder framework is effective for remote sensing image captioning and LSTM is good at generating semantically correct sentences

Summary

Introduction

Image captioning is a complicated task that bridges both the visual and linguistic domains. In this task, image captioning models are required to understand the content of input images to generate sentences with human languages. Unlike most of other existing models, designed for classification [1]–[4], object detection [5]–[12], and semantic segmentation [13]–[17], [17], [18] tasks, image caption models are able to satisfy the demand of refinement retrieval. With the rapid development of remote sensing technology, remote sensing images with high resolution can be accessed. The increase of quantity of remote sensing images bring more difficulty for managing such big remote sensing images. Remote sensing image captioning (RSIC) is quite meaningful for this problem. What’s more, many applications, such as remote sensing

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 64	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Unsupervised image saliency detection with Gestalt-laws guided optimization and visual attention based refinement
Yijun Yan ... Jin Zhan
Pattern Recognition | VOL. 79
Yijun Yan, et. al.Yijun Yan ... Jin Zhan
05 Feb 2018
Pattern Recognition | VOL. 79

The ‘Narcissus Effect’: Top-down alpha-beta band modulation of face-related brain areas during self-face processing
Elisabet Alzueta ... Almudena Capilla
NeuroImage | VOL. 213
Elisabet Alzueta, et. al.Elisabet Alzueta ... Almudena Capilla
16 Mar 2020
NeuroImage | VOL. 213

Attentional dynamics during free picture viewing: Evidence from oculomotor behavior and electrocortical activity
Thomas Fischer ... Sven-Thomas Graupner
Frontiers in Systems Neuroscience | VOL. 7
Thomas Fischer, et. al.Thomas Fischer ... Sven-Thomas Graupner
01 Jan 2013
Frontiers in Systems Neuroscience | VOL. 7

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
Peter Anderson ... Stephen Gould
-
Peter Anderson, et. al.Peter Anderson ... Stephen Gould
01 Jun 2018
01 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access