Abstract

Image captioning generates written descriptions of an image. In recent image captioning research, attention regions seldom cover all objects, and generated captions may lack the details of objects and may remain far from reality. In this paper, we propose a word guided attention (WGA) method for image captioning. First, WGA extracts word information using the embedded word and memory cell by applying transformation and multiplication. Then, WGA applies word information to the attention results and obtains the attended feature vectors via elementwise multiplication. Finally, we apply WGA with the words from different time steps to obtain previous word guided attention (PW) and current word attention (CW) in the decoder. Experiments on the MSCOCO dataset show that our proposed WGA can achieve competitive performance against state-of-the-art methods, with PW results of a 39.1 Bilingual Evaluation Understudy score (BLEU-4) and a 127.6 Consensus-Based Image Description Evaluation score (CIDEr-D); and CW results of a 39.1 BLEU-4 score and a 127.2 CIDER-D score on a Karpathy test split.

Highlights

  • Image captioning is synthetic research that spans computer vision and natural language processing to generate natural descriptions of images

  • Inspired by attention mechanisms [8] and sequence–sequence models [9] exploited in machine translation tasks, an encoder–decoder framework [10,11,12,13,14] has been widely used for image captioning

  • We propose a novel attention guided by word information (WGA) for image captioning, which is aimed at extracting more valuable information from the images

Read more

Summary

Introduction

Image captioning is synthetic research that spans computer vision and natural language processing to generate natural descriptions of images. Inspired by attention mechanisms [8] and sequence–sequence models [9] exploited in machine translation tasks, an encoder–decoder framework [10,11,12,13,14] has been widely used for image captioning In such a framework, images are encoded to feature vectors by a pretrained image classification model, object detection model, or semantic segmentation model, and decoded to words via an RNN. To address this issue, we propose word guided attention (WGA), which is created from word information, to bring novel specific guidance to the decoder. The information processing method includes memory cell weighting, embedded words, and basic attention Based on this process, we construct a WGA module in the decoder. With the current step word, the WGA is devoted to obtaining more details and deeper relation information from the current attention region

Image Captioning
Attention Mechanism
Methods
Image Captioning Model
Training and Objectives
Dataset
Implementation Details
Quantitative Analysis
Method
Qualitative Analysis
Ablative Studies
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call