Abstract

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, top-down and bottom-up , the former transfers the image information (called as visual-level feature ) directly into a caption, and the later uses the extracted words (called as semantic-level attribute ) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual -level or semantic -level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS ) for rich fine-grained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual -level and semantic -level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-grained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4 / CIDEr / SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-art.

Highlights

  • Image annotation has a significant effect for content-based image retrieval (CBIR) [1], [2], which is a process of assigning metadata in the form of captioning or keywords to an image

  • Great attention has been paid to exploit the encoder-decoder architecture for image caption generation [3]–[8], which commonly consists of a Convolutional Neural Network (CNN) based image feature encoder and a Recurrent Neural Network (RNN) based sentence decoder

  • This paper proposes a unified coarse-to-fine multi-stage architecture to combine bottomup and top-down approaches based on visual-semantic attention model, which is capable of effectively leveraging both the visual-level image feature and semantic-level attributes for image caption generation

Read more

Summary

Introduction

Image annotation has a significant effect for content-based image retrieval (CBIR) [1], [2], which is a process of assigning metadata in the form of captioning or keywords to an image. The challenge of AICG task lies in effectively modeling on both visual-level and semantic-level information of the given image for generating a meaningful human-like rich image description. There already exist several efforts dedicated to research on this topic, which can be roughly categorized into two classes, i.e., top-down [3]–[5], [9], [10] and bottom-up [11], [12]. The former converts image information (called as visual feature) directly into descriptions, while

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.