Abstract

Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.

Highlights

  • Image captioning is the task of automatically generating a textual description of an image [1].The goal pursued by the researchers is to make these textual descriptions as similar as possible to how a human would describe an image

  • We proposed the use of augmentation of image captions in a dataset to improve a solution of the image captioning problem

  • We used MSCOCO [15], which is the largest and most used dataset for image captioning as a base dataset for performing augmentation, in order to compare the effectiveness of the augmentation methods

Read more

Summary

Introduction

Image captioning is the task of automatically generating a textual description of an image [1].The goal pursued by the researchers is to make these textual descriptions as similar as possible to how a human would describe an image. Most of the used neural networks architectures are of an encoder-decoder type for example [4,5,6,7,8] In such models, an image is first encoded to its hidden representation and a textual description of this image is generated (decoded) based on this hidden representation. An image is first encoded to its hidden representation and a textual description of this image is generated (decoded) based on this hidden representation Convolutional neural networks, such as VGG [9] and ResNet [10], are most often used as encoders, because they have proven themselves in a variety of different computer vision tasks. Recurrent neural networks, such as RNN [11] or LSTM [12], are used as decoders due to their wide applicability for natural language processing tasks

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call