A Position-Aware Transformer for Image Captioning

Zelin Deng,Amr Tolba,Pei He,Jianfeng Huang,Bo Zhou,Osama Alfarraj

doi:10.32604/cmc.2022.019328

Abstract

Image captioning aims to generate a corresponding description of an image. In recent years, neural encoder-decoder models have been the dominant approaches, in which the Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) are used to translate an image into a natural language description. Among these approaches, the visual attention mechanisms are widely used to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. However, most conventional visual attention mechanisms are based on high-level image features, ignoring the effects of other image features, and giving insufficient consideration to the relative positions between image features. In this work, we propose a Position-Aware Transformer model with image-feature attention and position-aware attention mechanisms for the above problems. The image-feature attention firstly extracts multi-level features by using Feature Pyramid Network (FPN), then utilizes the scaled-dot-product to fuse these features, which enables our model to detect objects of different scales in the image more effectively without increasing parameters. In the position-aware attention mechanism, the relative positions between image features are obtained at first, afterwards the relative positions are incorporated into the original image features to generate captions more accurately. Experiments are carried out on the MSCOCO dataset and our approach achieves competitive BLEU-4, METEOR, ROUGE-L, CIDEr scores compared with some state-of-the-art approaches, demonstrating the effectiveness of our approach.

Highlights

The work of Vaswani et al [16] shows that the transformer has excellent performance on machine translation or other sequence-to-sequence problems. It is based on the self-attention mechanism and enables models to be trained in parallel by excluding recurrent structures
To solve the above problems, Xu et al [12] introduced the attention mechanism for image captioning, which guided the model to different salient regions of the image dynamically at each step, instead of feeding all image features to the decoder at the initial step
To investigate the performance improved by the proposed sub-modules, we report SPICE F-scores over various subcategories on the MSCOCO testing set in Tab. 3 and Fig. 8

Summary

Introduction

Image captioning [1] aims to describe the visual contents of an image in natural language, which is a sequence-to-sequence problem and can be viewed as translating an image into its. The work of Vaswani et al [16] shows that the transformer has excellent performance on machine translation or other sequence-to-sequence problems It is based on the self-attention mechanism and enables models to be trained in parallel by excluding recurrent structures. In order to obtain captions of superior quality, a Position-aware Transformer model for image captioning is proposed The contributions of this model are as follows: (1) To enable the model to detect objects of different scales in the image without increasing the number of parameters, the image-feature attention is proposed, which uses the scaled-dot-product to fuse multi-level features within an image feature pyramid; (2) To generate more human-like captions, the position-aware attention is proposed to learn relative positions between image features, making features can be explained from the perspective of spatial relationship.

Image Captioning and Attention Mechanism

Transformer and Self-Attention Mechanism

Relative Position Information

The Proposed Approach

Image-Feature Attention for Feature Fusion

Position-Aware Attention

Metrics

Loss Functions

Dataset

Data Preprocessing

Inference

Implementation Details

Ablation Studies

Comparing with Other State-of-the-Art Methods

Conclusion and Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computers, Materials & Continua	Publication Date: Jan 1, 2022
Citations: 5	License type: cc-by

R Discovery Prime

R Discovery Prime

A Position-Aware Transformer for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, Materials & Continua

Lead the way for us

Similar Papers

Gujarati Task Oriented Dialogue Slot Tagging Using Deep Neural Network Models
Rachana Parikh ... Hiren Joshi
-
Rachana Parikh, et. al.Rachana Parikh ... Hiren Joshi
01 Jan 2020
01 Jan 2020

Spatiotemporal Co-Attention Hybrid Neural Network for Pedestrian Localization Based on 6D IMU
Yingying Wang ... Hu Cheng
IEEE Transactions on Automation Science and Engineering | VOL. 20
Yingying Wang, et. al.Yingying Wang ... Hu Cheng
01 Jan 2023
IEEE Transactions on Automation Science and Engineering | VOL. 20

Chinese Image Caption Generation via Visual Attention and Topic Modeling.
Maofu Liu ... Lingjun Li
IEEE Transactions on Cybernetics | VOL. 52
Maofu Liu, et. al.Maofu Liu ... Lingjun Li
22 Jun 2020
IEEE Transactions on Cybernetics | VOL. 52

Predicting of air pollutant concentrations based on spatio-temporal attention convolutional LSTM networks
P Jiang ... A Hmelnov
-
P Jiang, et. al.P Jiang ... A Hmelnov
05 May 2021
05 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Position-Aware Transformer for Image Captioning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computers, Materials &amp; Continua

More From: Computers, Materials & Continua