Abstract
Image caption based on reinforcement learning (RL) methods has achieved significant success recently. Most of these methods take CIDEr score as the reward of reinforcement learning algorithm to compute gradients, thus refining the image caption baseline model. However, CIDEr score is not the sole criterion to judge the quality of a generated caption. In this paper, a Hierarchical Attention Fusion (HAF) model is presented as a baseline for image caption based on RL, where multi-level feature maps of Resnet are integrated with hierarchical attention. Revaluation network (REN) is exploited for revaluating CIDEr score by assigning different weights for each word according to the importance of each word in a generating caption. The weighted reward can be regarded as word-level reward. Moreover, Scoring Network (SN) is implemented to score the generating sentence with its corresponding ground truth from a batch of captions. This reward can obtain benefits from additional unmatched ground truth, which acts as sentence-level reward. Experimental results on the COCO dataset show that the proposed methods have achieved competitive performance compared with the related image caption methods.
Highlights
The goal of image caption is to automatically generate a natural language description of an given image
The contributions of this paper are as follows: 1) A Hierarchical Attention Fusion (HAF) model is presented as a baseline of reinforcement learning (RL) training for image caption
The scores produced by Scoring Network (SN) with CIDEr acts as sentence-level reward in RL training, which encourages a generated caption from captioning model to be the best matching to the given ground truth
Summary
The goal of image caption is to automatically generate a natural language description of an given image. C. Wu et al.: Hierarchical Attention-Based Fusion for Image Caption With Multi-Grained Rewards. Word-level reward is exploited to revise the image caption model based on RL training, aiming to address the different importance of each generated word (see the left part of Figure 1). Computing the evaluation metrics (eg.CIDEr, BLEU) as reward signal is an intuitive way in RL training to generate more human-like captions. The contributions of this paper are as follows: 1) A Hierarchical Attention Fusion (HAF) model is presented as a baseline of RL training for image caption. 2) Revaluation network (REN) is proposed for facilitating revaluation reward calculation, which assignes different importance to the generated words in a sentence automatically during the RL training phase. SN evaluates a generated caption from both the correspondence to the matched ground truth and the discriminativeness to the unmatched ground truth, which enforces generated captions to be the best matching of its corresponding ground truths
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.