Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Boeun Kim,Hyedong Jung,Saim Shin

doi:10.3390/app9132699

Boeun Kim, Hyedong Jung + Show 1 more

Open Access

https://doi.org/10.3390/app9132699

Copy DOI

Journal: Applied Sciences	Publication Date: Jul 2, 2019
Citations: 6	License type: CC BY 4.0

Affiliation: Korea Electronics Technology Institute

Abstract

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been focused on generating one caption per image. However, to increase usability in applications, it is necessary to generate several different captions that contain various representations for an image. We propose a method to generate multiple captions using a variational autoencoder, which is one of the generative models. Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. In addition, methods for the evaluation of multiple image captioning tasks are proposed that have not yet been actively researched. The proposed model outperforms in the aspect of diversity compared with the base model when the accuracy is comparable. Moreover, it is verified that the model using CAM generates detailed captions describing various content in the image.

Highlights

An image captioning system able to describe an input image as a sentence has many potential applications
With the proposed method, only one caption was generated from one sampling in the network, and the performance varied depending on where the latent variable was extracted in the latent space
The vector that reflects the image-attention region style is extracted from the latent space

Summary

Introduction

An image captioning system able to describe an input image as a sentence has many potential applications. Image captioning using deep learning has been actively studied in the field of computer vision. These studies have focused on producing one caption per image. Most popular studies have used a convolutional neural network (CNN) as an image encoder and a recurrent neural network (RNN) as a module for generating sentences [1,2,3,4]. VAE consists of two neural network modules, encoder, and decoder, for learning the probability distributions of data. Let q(z| x ) and p( x |z) be the probability distributions of the encoder and the decoder, respectively.

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Tracking illegal activities using video surveillance systems: a review of the current state of research
D. O. Zhadan ... M. V. Mordvyntsev
Law and Safety | VOL. 92
D. O. Zhadan, et. al.D. O. Zhadan ... M. V. Mordvyntsev
29 Mar 2024
Law and Safety | VOL. 92

Detection and tracking of humans from an airborne platform
Gertjan Burghouts ... Judith Dijk
-
Gertjan Burghouts, et. al.Gertjan Burghouts ... Judith Dijk
07 Oct 2014
07 Oct 2014

Collecting and Annotating Human Activities in Web Videos
Fabian Caba Heilbron ... Juan Carlos Niebles
-
Fabian Caba Heilbron, et. al.Fabian Caba Heilbron ... Juan Carlos Niebles
01 Apr 2014
01 Apr 2014

Reduction of bubble-like frames using a RSS filter in wireless capsule endoscopy video
Qian Wang ... Xijing Zou
Optics & Laser Technology | VOL. 110
Qian Wang, et. al.Qian Wang ... Xijing Zou
05 Sep 2018
Optics & Laser Technology | VOL. 110

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences