Show, tell and rectify: Boost image caption generation via an output rectifier

Guowei Ge,Yufeng Han,Lingguang Hao,Kuangrong Hao,Bing Wei,Xue-song Tang

doi:10.1016/j.neucom.2024.127651

Abstract

Transformer-based models have excellent performance in capturing the interactions between textual and visual features. However, language bias remains a thorny problem in the image captioning domain, leading to the inconsistency between the generated sentences and the actual images. Existing models focus on preventing the wrong words from being output, with little attention to how to correct them. The problem is that if the current word has not yet been output, the model cannot accurately determine whether it is correct. To address this issue, a Double Decoding Transformer framework is proposed. First, a Rectifier is introduced to correct the output sentences in the absence of a language pre-trained module. In addition, visual features provide powerful guidance for attention distribution and redistribution in the Decoder and the Rectifier of the proposed framework, respectively. Due to the presence of downsampling, information loss in the visual feature extraction process is inevitable. Therefore, a Visual Feature Compensation (VFC) module is proposed to compensate for the loss of visual information as much as possible. Finally, by integrating these two modules into a transformer-based framework, a Double Decoding Transformer – D2 Transformer is built. Extensive experiments on the MSCOCO dataset with the “Karpathy” test set demonstrate the validity of the proposed model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Show, tell and rectify: Boost image caption generation via an output rectifier

Abstract

Talk to us

Similar Papers

More From: Neurocomputing

Lead the way for us

Similar Papers

A heterogenous automatic feedback semi-supervised method for image reranking
Xin-Chao Xu ... Xin-Shun Xu
-
Xin-Chao Xu, et. al.Xin-Chao Xu ... Xin-Shun Xu
01 Jan 2013
01 Jan 2013

Multimodal Arabic Rumors Detection
Rasha M Albalawi ... Alaa O Khadidos
IEEE Access | VOL. 11
Rasha M Albalawi, et. al.Rasha M Albalawi ... Alaa O Khadidos
01 Jan 2023
IEEE Access | VOL. 11

The Symbolist Conception of Illustration and Tyra Kleen’s Nevermore
Birte Bruchmüller
The Edgar Allan Poe Review | VOL. 22
Birte BruchmüllerBirte Bruchmüller
01 Jun 2021
The Edgar Allan Poe Review | VOL. 22

VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search.
Shuting He ... Xudong Jiang
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP
Shuting He, et. al.Shuting He ... Xudong Jiang
01 Jan 2024
IEEE transactions on image processing : a publication of the IEEE Signal Processing Society | VOL. PP

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Show, tell and rectify: Boost image caption generation via an output rectifier

Abstract

Talk to us

Similar Papers

More From: Neurocomputing