An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks

Raymond Ian Osolo,Jun Long,Zhan Yang

doi:10.3390/app112411635

Abstract

In the quest to make deep learning systems more capable, a number of more complex, more computationally expensive and memory intensive algorithms have been proposed. This switchover glosses over the capabilities of many of the simpler systems or modules within them to adequately address current and future problems. This has led to some of the deep learning research being inaccessible to researchers who don’t possess top-of-the-line hardware. The use of simple feed forward networks has not been explicitly explored in the current transformer-based vision-language field. In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention mechanism and limit its application in long sequence task scenarios. We demonstrate that a decoder does not require masking for conditional short sequence generation where the task is not only dependent on the previously generated sequence, but another input such as image features. We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset. Our best feed-forward model obtains average scores of over 90% of the current state-of-the-art pre-trained Oscar model in the conventional image captioning metrics. We also demonstrate that the proposed models take less time training and use less memory at larger batch sizes and longer sequence lengths.

Highlights

By allowing for the hierarchical representation of features, with complex features described in subsequent layers by successively simpler features, i.e., multiple levels of abstraction, deep learning algorithms have led to many breakthroughs in representation learning dependent tasks
The recent popularity of deep learning algorithms can be attributed to their success in the image recognition field where AlexNet [28], which uses deep convolutional neural networks, achieved a top-5 error rate of 15.3%; 10.8 percentage points higher than the second ranked model, in the ImageNet Large Scale Visual Recognition Challenge [17]
We presented simple yet effective transformer-based image captioning models and performed detailed analyses of the use of feed-forward layers to both encode the images and text captions

Summary

Introduction

By allowing for the hierarchical representation of features, with complex features described in subsequent layers by successively simpler features, i.e., multiple levels of abstraction, deep learning algorithms have led to many breakthroughs in representation learning dependent tasks. The recent popularity of deep learning algorithms can be attributed to their success in the image recognition field where AlexNet [28], which uses deep convolutional neural networks, achieved a top-5 error rate of 15.3%; 10.8 percentage points higher than the second ranked model, in the ImageNet Large Scale Visual Recognition Challenge [17]. Due to the fact that image captioning involves both images and the text, the two most common deep learning algorithms involved are the CNN to encode the image features into a fixed vector representation, and an RNN to learn to generate the text captions This is further discussed in the subsequent sections

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: Dec 8, 2021
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

Reducing the training time of deep learning models using synchronous SGD and large batch size
Salah Eddine Loukili ... Abdellah Ezzati
-
Salah Eddine Loukili, et. al.Salah Eddine Loukili ... Abdellah Ezzati
18 May 2022
18 May 2022

Mapping Knowledge Domain Analysis in Deep Learning Research of Global Education
Qingna Pan ... Dan Wang
Sustainability | VOL. 15
Qingna Pan, et. al.Qingna Pan ... Dan Wang
08 Feb 2023
Sustainability | VOL. 15

Towards Language-Free Training for Text-to-Image Generation
Yufan Zhou ... Changyou Chen
-
Yufan Zhou, et. al.Yufan Zhou ... Changyou Chen
01 Jun 2022
01 Jun 2022

Efficient Dual Batch Size Deep Learning for Distributed Parameter Server Systems
Kuan-Wei Lu ... Jan-Jan Wu
-
Kuan-Wei Lu, et. al.Kuan-Wei Lu ... Jan-Jan Wu
01 Jun 2022
01 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences