Captioning Transformer with Stacked Attention Modules

Xinxin Zhu,Lixiang Li,Jing Liu,Haipeng Peng,Xinxin Niu

doi:10.3390/app8050739

Abstract

Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the meaning of an image better. In recent years, the image captioning usually use the long-short-term-memory (LSTM) as the decoder to generate the sentence, and these models show excellent performance. Although the LSTM can memorize dependencies, the LSTM structure has complicated and inherently sequential across time problems. To address these issues, recent works have shown benefits of the Transformer for machine translation. Inspired by their success, we develop a Captioning Transformer (CT) model with stacked attention modules. We attempt to introduce the Transformer to the image captioning task. The CT model contains only attention modules without the dependencies of the time. It not only can memorize dependencies between the sequence but also can be trained in parallel. Moreover, we propose the multi-level supervision to make the Transformer achieve better performance. Extensive experiments are carried out on the challenging MSCOCO dataset and the proposed Captioning Transformer achieves competitive performance compared with some state-of-the-art methods.

Highlights

The target of image captioning is to describe the content of images automatically
Our key contributions are presented as follows: (a) A Captioning Transformer model that shows comparable performance to an LSTM-based method on standard metrics; (b) For better training the Transformer, a multi-level supervision training method is proposed to improve the performance. (c) We evaluate our architecture on the challenging MSCOCO dataset, and compare it with the LSTM and the
The Captioning Transformer (CT)-C3-64a4n6 uses the third combination method, which the input to the Transformer is used in the Neural Image Caption (NIC), and the spatial image matrices are used as the input of the second sub-layer of the Transformer

Summary

Introduction

The target of image captioning is to describe the content of images automatically. an image may contain various objects and these objects may have complex relations. RNN has the complex addressing and overwriting mechanism combined with inherently sequential processing problems To generate the current hidden state, it needs the previous hidden state ht−1 as the input This mechanism is designed to make a good relationship across different time, but it leads to the sequence training problem. The Transformer [3] model contains the stacked attention mechanism, eschewing recurrence. This mechanism can draw global dependencies between input and output. Our key contributions are presented as follows: (a) A Captioning Transformer model that shows comparable performance to an LSTM-based method on standard metrics; (b) For better training the Transformer, a multi-level supervision training method is proposed to improve the performance.

Image Captioning

Transformer

Model Architecture

Encoder

Decoder

Image Embedding

Text Embedding

Image Combination

Scaled Dot-Product Attention

Multi-Head Attention

Multi-Level Supervision

Training

Inference

Experiments

Results

Methods

Quantitative Analysis

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: May 7, 2018
Citations: 81	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Captioning Transformer with Stacked Attention Modules

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Transmission Line Fault Location Using Deep Learning Techniques
Rui Fan ... Jianming Lian
-
Rui Fan, et. al.Rui Fan ... Jianming Lian
01 Oct 2019
01 Oct 2019

Data-Driven End-To-End Production Prediction of Oil Reservoirs by EnKF-Enhanced Recurrent Neural Networks
Emilio Jose Coutinho ... Eduardo Gildin
-
Emilio Jose Coutinho, et. al.Emilio Jose Coutinho ... Eduardo Gildin
20 Jul 2020
20 Jul 2020

Exploring the development and application of LSTM variants
Nuo Chen
Applied and Computational Engineering | VOL. 53
Nuo ChenNuo Chen
28 Mar 2024
Applied and Computational Engineering | VOL. 53

A Novel Variant of LSTM Stock Prediction Method Incorporating Attention Mechanism
Shuai Sang ... Lu Li
Mathematics | VOL. 12
Shuai Sang, et. al.Shuai Sang ... Lu Li
22 Mar 2024
Mathematics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Captioning Transformer with Stacked Attention Modules

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences