Abstract

Object-level features from Faster R-CNN and attention mechanism have been used extensively in image captioning based on Encoder-Decoder frameworks. However, most existing methods feed the average pooling of object features as the global representation to the captioning model and recalculate the attention weights of object regions when generating a new word without considering the visual persistence like humans. In this paper, we respectively build Visual Persistence modules in encoder and decoder: The visual persistence module in encoder seeks the core object features to replace the image global representation; the visual persistence module in decoder evaluates the correlation between previous attention results and current attention results, and fuses them as the final attended feature to generate a new word. The experimental results on MSCOCO validate the effectiveness and competitiveness of our Visual Persistence Model (VPNet). Remarkably, VPNet also achieves competitive scores in most metrics on MSCOCO online test server compared to the existing state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call