Abstract

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been focused on generating one caption per image. However, to increase usability in applications, it is necessary to generate several different captions that contain various representations for an image. We propose a method to generate multiple captions using a variational autoencoder, which is one of the generative models. Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. In addition, methods for the evaluation of multiple image captioning tasks are proposed that have not yet been actively researched. The proposed model outperforms in the aspect of diversity compared with the base model when the accuracy is comparable. Moreover, it is verified that the model using CAM generates detailed captions describing various content in the image.

Highlights

  • An image captioning system able to describe an input image as a sentence has many potential applications

  • With the proposed method, only one caption was generated from one sampling in the network, and the performance varied depending on where the latent variable was extracted in the latent space

  • The vector that reflects the image-attention region style is extracted from the latent space

Read more

Summary

Introduction

An image captioning system able to describe an input image as a sentence has many potential applications. Image captioning using deep learning has been actively studied in the field of computer vision. These studies have focused on producing one caption per image. Most popular studies have used a convolutional neural network (CNN) as an image encoder and a recurrent neural network (RNN) as a module for generating sentences [1,2,3,4]. VAE consists of two neural network modules, encoder, and decoder, for learning the probability distributions of data. Let q(z| x ) and p( x |z) be the probability distributions of the encoder and the decoder, respectively.

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.