Abstract
Image caption is to understand and describe the visual content, which is expected to be applied in automatic news reporting in future. In recent years, there has been an increasing interest in an Encoder-Decoder framework for image caption: the encoder takes the responsibility for visual semantic comprehension and the decoder is designed for sentence generation. In the Encoder-Decoder framework the translation is based on the correspondence between image feature vectors and caption vectors. Attention mechanism makes sense for a more accurate correspondence. However, the attention model works with the decoder, and the focused content changes dynamically with the generated word. It results that in many cases the salient contents are not described in the caption, or the objects described are not the salient ones. To improve the precision of image caption, to bridge the gap between image understanding and sentence generation in the Encoder-Decoder framework, and to align visual information and semantic information better, we propose a concept of visual keyword as a gang board between seeing and saying. This paper presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD). Also, a Visual Semantic Attention Model(VSAM) is proposed to obtain visual keywords for generating the annotation. In VSAM, the object-level visual features are extracted by an object detector after pre-training on IVKD. Then the object features are fed in an Optimized Pointer Network(OPN) to generate visual keywords. The experiments show that the precision of visual keyword generation reaches 91.7% by the proposed model VSAM.
Highlights
Humans can describe an image in words, focusing on the important and interesting things in a view
Image caption is an interdisciplinary research of computer vision and natural language processing
This paper presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD)
Summary
Humans can describe an image in words, focusing on the important and interesting things in a view. Compared to the traditional image caption methods, in the Encoder-Decoder framework all the knowledge could be learned from data, and the generated sentences are much more abundant. The captions sound smooth and reasonable, because it tends to copy captions from training data, rather than accurately match the visual content This method align all semantic features with visual feature vector. The noticed content dynamically changes when the sentence is generated word by word This dynamic attention mechanism increases the burden of decoder to judge salient objects. The motivation of this paper is to bridge the gap between the understanding of visual content and the sentence generation in Encoder-Decoder framework. It is more sensitive to the length of sequence and easier to converge with the same hyper parameter of the unoptimized
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.