Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Ling Cheng,Chunyan Miao,Yong Liu,Xianling Mao,Wei Wei

doi:10.1109/access.2020.3018752

Abstract

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up , the former transfers the image information (called as visual-level feature ) directly into a caption, and the later uses the extracted words (called as semantic-level attribute ) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual -level or semantic -level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS ) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual -level and semantic -level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

Highlights

Image annotation has a significant effect for content-based image retrieval (CBIR) [1], [2], which is a process of assigning metadata in the form of captioning or keywords to an image
Great attention has been paid to exploit the encoder-decoder architecture for image caption generation [3]–[8], which commonly consists of a Convolutional Neural Network (CNN) based image feature encoder and a Recurrent Neural Network (RNN) based sentence decoder
This paper proposes a unified coarse-to-fine multi-stage architecture to combine bottomup and top-down approaches based on visual-semantic attention model, which is capable of effectively leveraging both the visual-level image feature and semantic-level attributes for image caption generation

Summary

Introduction

Image annotation has a significant effect for content-based image retrieval (CBIR) [1], [2], which is a process of assigning metadata in the form of captioning or keywords to an image. The challenge of AICG task lies in effectively modeling on both visual-level and semantic-level information of the given image for generating a meaningful human-like rich image description. There already exist several efforts dedicated to research on this topic, which can be roughly categorized into two classes, i.e., top-down [3]–[5], [9], [10] and bottom-up [11], [12]. The former converts image information (called as visual feature) directly into descriptions, while

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 42	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Image caption generation with dual attention mechanism
Maofu Liu ... Jing Tian
Information Processing & Management | VOL. 57
Maofu Liu, et. al.Maofu Liu ... Jing Tian
12 Dec 2019
Information Processing & Management | VOL. 57

Image caption generation method based on an interaction mechanism and scene concept selection module
Liping Zhang ... Qin Lu
-
Liping Zhang, et. al.Liping Zhang ... Qin Lu
01 Jul 2021
01 Jul 2021

Comparative Evaluation of CNN Architectures for Image Caption Generation
Sulabh Katiyar ... Samir Kumar
International Journal of Advanced Computer Science and Applications | VOL. 11
Sulabh Katiyar, et. al.Sulabh Katiyar ... Samir Kumar
01 Jan 2020
International Journal of Advanced Computer Science and Applications | VOL. 11

Image Caption Generation and Comprehensive Comparison of Image Encoders
Shitiz Gupta ... Shubham Agnihotri
Fusion: Practice and Applications | VOL. -
Shitiz Gupta, et. al.Shitiz Gupta ... Shubham Agnihotri
01 Jan 2020
Fusion: Practice and Applications | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access