VSAM-Based Visual Keyword Generation for Image Caption

Suya Zhang,Zhaohui Li,Yana Zhang,Zeyu Chen

doi:10.1109/access.2021.3058425

Abstract

Image caption is to understand and describe the visual content, which is expected to be applied in automatic news reporting in future. In recent years, there has been an increasing interest in an Encoder-Decoder framework for image caption: the encoder takes the responsibility for visual semantic comprehension and the decoder is designed for sentence generation. In the Encoder-Decoder framework the translation is based on the correspondence between image feature vectors and caption vectors. Attention mechanism makes sense for a more accurate correspondence. However, the attention model works with the decoder, and the focused content changes dynamically with the generated word. It results that in many cases the salient contents are not described in the caption, or the objects described are not the salient ones. To improve the precision of image caption, to bridge the gap between image understanding and sentence generation in the Encoder-Decoder framework, and to align visual information and semantic information better, we propose a concept of visual keyword as a gang board between seeing and saying. This paper presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD). Also, a Visual Semantic Attention Model(VSAM) is proposed to obtain visual keywords for generating the annotation. In VSAM, the object-level visual features are extracted by an object detector after pre-training on IVKD. Then the object features are fed in an Optimized Pointer Network(OPN) to generate visual keywords. The experiments show that the precision of visual keyword generation reaches 91.7% by the proposed model VSAM.

Highlights

Humans can describe an image in words, focusing on the important and interesting things in a view
Image caption is an interdisciplinary research of computer vision and natural language processing
This paper presents an image dataset derived from MSCOCO as the first collection of visual keywords: Image Visual Keyword Dataset (IVKD)

Summary

INTRODUCTION

Humans can describe an image in words, focusing on the important and interesting things in a view. Compared to the traditional image caption methods, in the Encoder-Decoder framework all the knowledge could be learned from data, and the generated sentences are much more abundant. The captions sound smooth and reasonable, because it tends to copy captions from training data, rather than accurately match the visual content This method align all semantic features with visual feature vector. The noticed content dynamically changes when the sentence is generated word by word This dynamic attention mechanism increases the burden of decoder to judge salient objects. The motivation of this paper is to bridge the gap between the understanding of visual content and the sentence generation in Encoder-Decoder framework. It is more sensitive to the length of sequence and easier to converge with the same hyper parameter of the unoptimized

RELATED WORK

THE FRAMEWORK OF VSAM

THE PRINCIPLE OF VISUAL KEYWORD GENERATION

EXPERIMENT SETTINGS

EVALUATION METHODS

EXPERIMENT RESULTS AND ANALYSIS

Findings

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 32	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

VSAM-Based Visual Keyword Generation for Image Caption

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Automated Image Captioning with Multi-layer Gated Recurrent Unit
Ozge Taylan Moral ... Volkan Kilic
-
Ozge Taylan Moral, et. al.Ozge Taylan Moral ... Volkan Kilic
29 Aug 2022
29 Aug 2022

Integrating Part of Speech Guidance for Image Captioning
Ji Zhang ... Jianping Fan
IEEE Transactions on Multimedia | VOL. 23
Ji Zhang, et. al.Ji Zhang ... Jianping Fan
06 Mar 2020
IEEE Transactions on Multimedia | VOL. 23

Synthesis of Vision and Language: Multifaceted Image Captioning Application
Arpit Gupta ... Himanshu Goyal
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07
Arpit Gupta, et. al.Arpit Gupta ... Himanshu Goyal
23 Dec 2023
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 07

Chinese Image Caption Generation via Visual Attention and Topic Modeling.
Maofu Liu ... Lingjun Li
IEEE Transactions on Cybernetics | VOL. 52
Maofu Liu, et. al.Maofu Liu ... Lingjun Li
22 Jun 2020
IEEE Transactions on Cybernetics | VOL. 52

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

VSAM-Based Visual Keyword Generation for Image Caption

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access