Abstract

Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the meaning of an image better. In recent years, the image captioning usually use the long-short-term-memory (LSTM) as the decoder to generate the sentence, and these models show excellent performance. Although the LSTM can memorize dependencies, the LSTM structure has complicated and inherently sequential across time problems. To address these issues, recent works have shown benefits of the Transformer for machine translation. Inspired by their success, we develop a Captioning Transformer (CT) model with stacked attention modules. We attempt to introduce the Transformer to the image captioning task. The CT model contains only attention modules without the dependencies of the time. It not only can memorize dependencies between the sequence but also can be trained in parallel. Moreover, we propose the multi-level supervision to make the Transformer achieve better performance. Extensive experiments are carried out on the challenging MSCOCO dataset and the proposed Captioning Transformer achieves competitive performance compared with some state-of-the-art methods.

Highlights

  • The target of image captioning is to describe the content of images automatically

  • Our key contributions are presented as follows: (a) A Captioning Transformer model that shows comparable performance to an LSTM-based method on standard metrics; (b) For better training the Transformer, a multi-level supervision training method is proposed to improve the performance. (c) We evaluate our architecture on the challenging MSCOCO dataset, and compare it with the LSTM and the

  • The Captioning Transformer (CT)-C3-64a4n6 uses the third combination method, which the input to the Transformer is used in the Neural Image Caption (NIC), and the spatial image matrices are used as the input of the second sub-layer of the Transformer

Read more

Summary

Introduction

The target of image captioning is to describe the content of images automatically. an image may contain various objects and these objects may have complex relations. RNN has the complex addressing and overwriting mechanism combined with inherently sequential processing problems To generate the current hidden state, it needs the previous hidden state ht−1 as the input This mechanism is designed to make a good relationship across different time, but it leads to the sequence training problem. The Transformer [3] model contains the stacked attention mechanism, eschewing recurrence. This mechanism can draw global dependencies between input and output. Our key contributions are presented as follows: (a) A Captioning Transformer model that shows comparable performance to an LSTM-based method on standard metrics; (b) For better training the Transformer, a multi-level supervision training method is proposed to improve the performance.

Image Captioning
Transformer
Model Architecture
Encoder
Decoder
Image Embedding
Text Embedding
Image Combination
Scaled Dot-Product Attention
Multi-Head Attention
Multi-Level Supervision
Training
Inference
Experiments
Results
Methods
Quantitative Analysis
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.