A Text-Guided Generation and Refinement Model for Image Captioning

Depeng Wang,Richang Hong,Meng Wang,Yuanen Zhou,Zhenzhen Hu

doi:10.1109/tmm.2022.3154149

Abstract

A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy of the content. However, due to the semantic gap between vision and language, most existing image captioning approaches that directly learn the cross-modal mapping from the vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the generating + refining route and propose a novel Text-Guided Generation and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions. The guide text is selected from the training set according to the visual content similarity, then utilized to explore the salient objects and extend the candidate words. Specifically, we follow the encoder-decoder architecture, and design a Text-Guided Relation Encoder (TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoder part into two submodules: a Generator for the primary sentence generation and a Refiner for the sentence refinement. Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentence logically and fluently. Refiner contains a caption encoder module, an attention-based LSTM and a GOA module, which iteratively modifies the details in the primary caption to make captions rich and accurate with help of the guide text. Extensive experiments on the MS COCO captioning dataset demonstrate that our framework with fewer parameters remains comparable to transformer-based methods, and achieves state-of-the-art performance compared with other relevant approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Text-Guided Generation and Refinement Model for Image Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2023
Citations: 9

Similar Papers

Deep learning models for hepatitis E incidence prediction leveraging meteorological factors.
Yi Feng ... Bingyu Yan
PLOS ONE | VOL. 18
Yi Feng, et. al.Yi Feng ... Bingyu Yan
13 Mar 2023
PLOS ONE | VOL. 18

VieCap4H Challenge 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM
Hai Nguyen ... Van Huong Do
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 38
Hai Nguyen, et. al.Hai Nguyen ... Van Huong Do
16 Dec 2022
VieCap4H Challenge 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM
Hai Nguyen ... Van Huong Do

Evaluation and Comparison of Random Forest and A-LSTM Networks for Large-scale Winter Wheat Identification
Tianle He ... Shiying Guan
Remote Sensing | VOL. 11
Tianle He, et. al.Tianle He ... Shiying Guan
12 Jul 2019
Remote Sensing | VOL. 11

Using Automatic Speech Recognition to Assess Thai Speech Language Fluency in the Montreal Cognitive Assessment (MoCA).
Pimarn Kantithammakorn ... Dittaya Wanvarie
Sensors | VOL. 22
Pimarn Kantithammakorn, et. al.Pimarn Kantithammakorn ... Dittaya Wanvarie
17 Feb 2022
Sensors | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Text-Guided Generation and Refinement Model for Image Captioning

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia