Discriminative Style Learning for Cross-Domain Image Captioning.

Jin Yuan,Meng Wang,Shuyin Huang,Shuai Zhu,Yaoqiang Xiao,Hanwang Zhang,Zhiyong Li

doi:10.1109/tip.2022.3145158

Abstract

The cross-domain image captioning, which is trained on a source domain and generalized to other domains, usually faces the large domain shift problem. Although prior work has attempted to leverage both paired source and unpaired target data to minimize this shift, the performance is still unsatisfactory. One main reason lies in the large discrepancy in language expression between two domains, where diverse language styles are adopted to describe an image from different views, resulting in different semantic descriptions for an image. To tackle this problem, this paper proposes a Style-based Cross-domain Image Captioner (SCIC) which incorporates the discriminative style information into the encoder-decoder framework, and interprets an image as a special sentence according to external style instructions. Technically, we design a novel "Instruction-based LSTM", which adds the instruct gate to collect a style instruction, and then outputs a specified format according to that instruction. Two objectives are designed to train I-LSTM: 1) generating correct image descriptions and 2) generating correct styles, thus the model is expected to accurately capture the semantic meanings of an image by the special caption as well as understand the syntactic structure of the caption. We use MS-COCO as the source domain, and Oxford-102, CUB-200, Flickr30k as the target domains. Experimental results demonstrate that our model consistently outperforms the previous methods, and the style information incorporating with I-LSTM significantly improves the performance, with 5% CIDEr improvements at least on all datasets.

Full Text