Tag‐inferring and tag‐guided Transformer for image captioning

Yaohua Yi,Yinkai Liang,Dezhu Kong,Ziwei Tang,Jibing Peng

doi:10.1049/cvi2.12280

Abstract

AbstractImage captioning is an important task for understanding images. Recently, many studies have used tags to build alignments between image information and language information. However, existing methods ignore the problem that simple semantic tags have difficulty expressing the detailed semantics for different image contents. Therefore, the authors propose a tag‐inferring and tag‐guided Transformer for image captioning to generate fine‐grained captions. First, a tag‐inferring encoder is proposed, which uses the tags extracted by the scene graph model to infer tags with deeper semantic information. Then, with the obtained deep tag information, a tag‐guided decoder that includes short‐term attention to improve the features of words in the sentence and gated cross‐modal attention to combine image features, tag features and language features to produce informative semantic features is proposed. Finally, the word probability distribution of all positions in the sequence is calculated to generate descriptions for the image. The experiments demonstrate that the authors’ method can combine tags to obtain precise captions and that it achieves competitive performance with a 40.6% BLEU‐4 score and 135.3% CIDEr score on the MSCOCO data set.

Full Text