Abstract

In the quest to make deep learning systems more capable, a number of more complex, more computationally expensive and memory intensive algorithms have been proposed. This switchover glosses over the capabilities of many of the simpler systems or modules within them to adequately address current and future problems. This has led to some of the deep learning research being inaccessible to researchers who don’t possess top-of-the-line hardware. The use of simple feed forward networks has not been explicitly explored in the current transformer-based vision-language field. In this paper, we use a series of feed-forward layers to encode image features, and caption embeddings, alleviating some of the effects of the computational complexities that accompany the use of the self-attention mechanism and limit its application in long sequence task scenarios. We demonstrate that a decoder does not require masking for conditional short sequence generation where the task is not only dependent on the previously generated sequence, but another input such as image features. We perform an empirical and qualitative analysis of the use of linear transforms in place of self-attention layers in vision-language models, and obtain competitive results on the MSCOCO dataset. Our best feed-forward model obtains average scores of over 90% of the current state-of-the-art pre-trained Oscar model in the conventional image captioning metrics. We also demonstrate that the proposed models take less time training and use less memory at larger batch sizes and longer sequence lengths.

Highlights

  • By allowing for the hierarchical representation of features, with complex features described in subsequent layers by successively simpler features, i.e., multiple levels of abstraction, deep learning algorithms have led to many breakthroughs in representation learning dependent tasks

  • The recent popularity of deep learning algorithms can be attributed to their success in the image recognition field where AlexNet [28], which uses deep convolutional neural networks, achieved a top-5 error rate of 15.3%; 10.8 percentage points higher than the second ranked model, in the ImageNet Large Scale Visual Recognition Challenge [17]

  • We presented simple yet effective transformer-based image captioning models and performed detailed analyses of the use of feed-forward layers to both encode the images and text captions

Read more

Summary

Introduction

By allowing for the hierarchical representation of features, with complex features described in subsequent layers by successively simpler features, i.e., multiple levels of abstraction, deep learning algorithms have led to many breakthroughs in representation learning dependent tasks. The recent popularity of deep learning algorithms can be attributed to their success in the image recognition field where AlexNet [28], which uses deep convolutional neural networks, achieved a top-5 error rate of 15.3%; 10.8 percentage points higher than the second ranked model, in the ImageNet Large Scale Visual Recognition Challenge [17]. Due to the fact that image captioning involves both images and the text, the two most common deep learning algorithms involved are the CNN to encode the image features into a fixed vector representation, and an RNN to learn to generate the text captions This is further discussed in the subsequent sections

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call