Abstract

A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy of the content. However, due to the semantic gap between vision and language, most existing image captioning approaches that directly learn the cross-modal mapping from the vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the generating + refining route and propose a novel Text-Guided Generation and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions. The guide text is selected from the training set according to the visual content similarity, then utilized to explore the salient objects and extend the candidate words. Specifically, we follow the encoder-decoder architecture, and design a Text-Guided Relation Encoder (TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoder part into two submodules: a Generator for the primary sentence generation and a Refiner for the sentence refinement. Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentence logically and fluently. Refiner contains a caption encoder module, an attention-based LSTM and a GOA module, which iteratively modifies the details in the primary caption to make captions rich and accurate with help of the guide text. Extensive experiments on the MS COCO captioning dataset demonstrate that our framework with fewer parameters remains comparable to transformer-based methods, and achieves state-of-the-art performance compared with other relevant approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call