Multi-Gate Attention Network for Image Captioning

Weitao Jiang,Haifeng Hu,Qiang Lu,Bohong Liu,Xiying Li

doi:10.1109/access.2021.3067607

Abstract

Self-attention mechanism, which has been successfully applied to current encoder-decoder framework of image captioning, is used to enhance the feature representation in the image encoder and capture the most relevant information for the language decoder. However, most existing methods will assign attention weights to all candidate vectors, which implicitly hypothesizes that all vectors are relevant. Moreover, current self-attention mechanisms ignore the intra-object attention distribution, and only consider the inter-object relationships. In this paper, we propose a Multi-Gate Attention (MGA) block, which expands the traditional self-attention by equipping with additional Attention Weight Gate (AWG) module and Self-Gated (SG) module. The former constrains the attention weights to be assigned to the most contributive objects. The latter is adopted to consider the intra-object attention distribution and eliminate the irrelevant information in object feature vector. Furthermore, most current image captioning methods apply the original transformer designed for natural language processing task, to refine image features directly. Therefore, we propose a pre-layernorm transformer to simplify the transformer architecture and make it more efficient for image feature enhancement. By integrating MGA block with pre-layernorm transformer architecture into the image encoder and AWG module into the language decoder, we present a novel Multi-Gate Attention Network (MGAN). The experiments on MS COCO dataset indicate that the MGAN outperforms most of the state-of-the-art, and further experiments on other methods combined with MGA blocks demonstrate the generalizability of our proposal.

Highlights

Image captioning is a challenging task of automatically generating a fluent and reasonable sentence to describe visual contents of an image
For a given image, a set of feature vectors are encoded by a Convolutional Neural Network (CNN) first, and a caption is decoded via a Recurrent Neural Network (RNN) with these vectors
We propose a Multi-Gate Attention (MGA) block which contains an Attention Weight Gate (AWG) module and a Self-Gated (SG) module to constrain the attention mechanism and consider the intra-object attention distribution respectively

Summary

INTRODUCTION

Image captioning is a challenging task of automatically generating a fluent and reasonable sentence to describe visual contents of an image. To tackle the problems mentioned above, in this paper, we propose a Multi-Gate Attention (MGA) block with prelayernorm transformer architecture for image captioning. It extends the vanilla self-attention by modifying the architecture and adding multiple gate mechanisms. After SG module, self-attention module is applied to model relationships among all input feature vectors, where the similarity scores between query and key vectors are calculated first, and passes these scores through a softmax layer to generate attention weights. By applying MGA block with pre-layernorm transformer architecture to the image encoder and AWG module to the language decoder, a Multi-Gate Attention Network (MGAN) is proposed. We describe how to construct the Multi-Gate Attention Network (MGAN) for image captioning

FRAMEWORK

MULTI-GATE ATTENTION BLOCK

PERFORMANCE COMPARISON

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 51	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Multi-Gate Attention Network for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Optimized Image Captioning: Hybrid Transformers Vision Transformers and Convolutional Neural Networks: Enhanced with Beam Search
Sushma Jaiswal ... Rajesh P Chinchewadi
International Journal of Intelligent Systems and Applications | VOL. 16
Sushma Jaiswal, et. al.Sushma Jaiswal ... Rajesh P Chinchewadi
08 Apr 2024
International Journal of Intelligent Systems and Applications | VOL. 16

Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
Deepika Kumar ... Jude D Hemanth
Applied Sciences | VOL. 12
Deepika Kumar, et. al.Deepika Kumar ... Jude D Hemanth
02 Jul 2022
Applied Sciences | VOL. 12

Automated Image Captioning with Multi-layer Gated Recurrent Unit
Ozge Taylan Moral ... Wenwu Wang
-
Ozge Taylan Moral, et. al.Ozge Taylan Moral ... Wenwu Wang
29 Aug 2022
29 Aug 2022

Synthesis of Vision and Language: Multifaceted Image Captioning Application
Arpit Gupta ... Ishita Kohli
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07
Arpit Gupta, et. al.Arpit Gupta ... Ishita Kohli
23 Dec 2023
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Gate Attention Network for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access