Abstract
Image captioning is an important task for improving human-computer interaction as well as for a deeper understanding of the mechanisms underlying the image description by human. In recent years, this research field has rapidly developed and a number of impressive results have been achieved. The typical models are based on a neural networks, including convolutional ones for encoding images and recurrent ones for decoding them into text. More than that, attention mechanism and transformers are actively used for boosting performance. However, even the best models have a limit in their quality with a lack of data. In order to generate a variety of descriptions of objects in different situations you need a large training set. The current commonly used datasets although rather large in terms of number of images are quite small in terms of the number of different captions per one image. We expanded the training dataset using text augmentation methods. Methods include augmentation with synonyms as a baseline and the state-of-the-art language model called Bidirectional Encoder Representations from Transformers (BERT). As a result, models that were trained on a datasets augmented show better results than that models trained on a dataset without augmentation.
Highlights
Image captioning is the task of automatically generating a textual description of an image [1].The goal pursued by the researchers is to make these textual descriptions as similar as possible to how a human would describe an image
We proposed the use of augmentation of image captions in a dataset to improve a solution of the image captioning problem
We used MSCOCO [15], which is the largest and most used dataset for image captioning as a base dataset for performing augmentation, in order to compare the effectiveness of the augmentation methods
Summary
Image captioning is the task of automatically generating a textual description of an image [1].The goal pursued by the researchers is to make these textual descriptions as similar as possible to how a human would describe an image. Most of the used neural networks architectures are of an encoder-decoder type for example [4,5,6,7,8] In such models, an image is first encoded to its hidden representation and a textual description of this image is generated (decoded) based on this hidden representation. An image is first encoded to its hidden representation and a textual description of this image is generated (decoded) based on this hidden representation Convolutional neural networks, such as VGG [9] and ResNet [10], are most often used as encoders, because they have proven themselves in a variety of different computer vision tasks. Recurrent neural networks, such as RNN [11] or LSTM [12], are used as decoders due to their wide applicability for natural language processing tasks
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have