Abstract
A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy of the content. However, due to the semantic gap between vision and language, most existing image captioning approaches that directly learn the cross-modal mapping from the vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the generating + refining route and propose a novel Text-Guided Generation and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions. The guide text is selected from the training set according to the visual content similarity, then utilized to explore the salient objects and extend the candidate words. Specifically, we follow the encoder-decoder architecture, and design a Text-Guided Relation Encoder (TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoder part into two submodules: a Generator for the primary sentence generation and a Refiner for the sentence refinement. Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentence logically and fluently. Refiner contains a caption encoder module, an attention-based LSTM and a GOA module, which iteratively modifies the details in the primary caption to make captions rich and accurate with help of the guide text. Extensive experiments on the MS COCO captioning dataset demonstrate that our framework with fewer parameters remains comparable to transformer-based methods, and achieves state-of-the-art performance compared with other relevant approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.