Abstract

In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.

Highlights

  • Image captioning is to automatically generate a natural language sentence given an image [1,2,3,4,5,6], for which an encoder-decoder framework with attention mechanisms has achieved great progress in recent years

  • The grid-based attention realized by fully connected layer treats the image features as a set of independent vectors, each of which corresponds to a region in the image grids and calculates attention weights for each vector and aggregates them with weighted sum

  • We propose a convolutional attention module called Structure Preserving Convolutional Attention (SPCA) that can preserve the spatial structure of the image by convolution operations directly on the 2D feature maps

Read more

Summary

Introduction

Image captioning is to automatically generate a natural language sentence given an image [1,2,3,4,5,6], for which an encoder-decoder framework with attention mechanisms has achieved great progress in recent years. The grid-based attention realized by fully connected layer treats the image features as a set of independent vectors, each of which corresponds to a region in the image grids and calculates attention weights for each vector and aggregates them with weighted sum. This operation totally breaks the spatial structure between each grid, which could be harmful to the model to fully understand the scene. OurConvolutional approach demonstrates and generalization ability when applied to two distinctive models with both 1D and 2D LSTM latent states

Image Captioning
Attention Mechanism in Captioning
Overview
Structure Preserving Convolutional Attention
Convolutional Spatial Attention
Cross Channel Attention
Dataset and Evalution
Implementation Details
Attention Structure Selection
C: Cross Channel Attention
Convolution Kernel Size
Performance Comparisons
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.