Abstract

Transformer-based models have excellent performance in capturing the interactions between textual and visual features. However, language bias remains a thorny problem in the image captioning domain, leading to the inconsistency between the generated sentences and the actual images. Existing models focus on preventing the wrong words from being output, with little attention to how to correct them. The problem is that if the current word has not yet been output, the model cannot accurately determine whether it is correct. To address this issue, a Double Decoding Transformer framework is proposed. First, a Rectifier is introduced to correct the output sentences in the absence of a language pre-trained module. In addition, visual features provide powerful guidance for attention distribution and redistribution in the Decoder and the Rectifier of the proposed framework, respectively. Due to the presence of downsampling, information loss in the visual feature extraction process is inevitable. Therefore, a Visual Feature Compensation (VFC) module is proposed to compensate for the loss of visual information as much as possible. Finally, by integrating these two modules into a transformer-based framework, a Double Decoding Transformer – D2 Transformer is built. Extensive experiments on the MSCOCO dataset with the “Karpathy” test set demonstrate the validity of the proposed model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call