Structure Preserving Convolutional Attention for Image Captioning

Shichen Lu,Jing Liu,Ruimin Hu,Fei Zheng,Longteng Guo

doi:10.3390/app9142888

Shichen Lu, Jing Liu + Show 3 more

Open Access

https://doi.org/10.3390/app9142888

Copy DOI

Abstract

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.

Highlights

Image captioning is to automatically generate a natural language sentence given an image [1,2,3,4,5,6], for which an encoder-decoder framework with attention mechanisms has achieved great progress in recent years
The grid-based attention realized by fully connected layer treats the image features as a set of independent vectors, each of which corresponds to a region in the image grids and calculates attention weights for each vector and aggregates them with weighted sum
We propose a convolutional attention module called Structure Preserving Convolutional Attention (SPCA) that can preserve the spatial structure of the image by convolution operations directly on the 2D feature maps

Summary

Introduction

Image captioning is to automatically generate a natural language sentence given an image [1,2,3,4,5,6], for which an encoder-decoder framework with attention mechanisms has achieved great progress in recent years. The grid-based attention realized by fully connected layer treats the image features as a set of independent vectors, each of which corresponds to a region in the image grids and calculates attention weights for each vector and aggregates them with weighted sum. This operation totally breaks the spatial structure between each grid, which could be harmful to the model to fully understand the scene. OurConvolutional approach demonstrates and generalization ability when applied to two distinctive models with both 1D and 2D LSTM latent states

Image Captioning

Attention Mechanism in Captioning

Overview

Structure Preserving Convolutional Attention

Convolutional Spatial Attention

Cross Channel Attention

Dataset and Evalution

Implementation Details

Attention Structure Selection

C: Cross Channel Attention

Convolution Kernel Size

Performance Comparisons

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Jul 19, 2019
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Structure Preserving Convolutional Attention for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning
Long Chen ... Hanwang Zhang
-
Long Chen, et. al.Long Chen ... Hanwang Zhang
01 Jul 2017
01 Jul 2017

Improved skeleton-based activity recognition using convolutional block attention module
Jing Qin ... Weigang Lu
Computers and Electrical Engineering | VOL. 116
Jing Qin, et. al.Jing Qin ... Weigang Lu
04 Apr 2024
Computers and Electrical Engineering | VOL. 116

Rail State Recognition Method Based on a Convolutional Block Attention Module
Lili Liu ... Yufeng Tang
Journal of Physics: Conference Series | VOL. 2868
Lili Liu, et. al.Lili Liu ... Yufeng Tang
01 Oct 2024
Journal of Physics: Conference Series | VOL. 2868

Iconographic Image Captioning for Artworks
Eva Cetinic
-
Eva CetinicEva Cetinic
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Structure Preserving Convolutional Attention for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences