CJE-TIG: Zero-shot cross-lingual text-to-image generation by Corpora-based Joint Encoding

Han Zhang,Suyi Yang,Hongqing Zhu

doi:10.1016/j.knosys.2021.108006

Abstract

Recently, text-to-Image (T2I) generation has been well developed by improving synthesis authenticity, text-consistency and generation diversity. However, large amount of pairwise image–text data required restricts generalization of synthesis models only to its pre-trained language. In this paper, a cross-lingual pre-training method is proposed to adapt target low-resource language to pre-trained generative models. As far as we known, this is the first time that arbitrary input languages could access T2I generation. This joint encoding scheme fulfills both universal and visual semantic alignment. With any prepared GAN-based T2I framework, pre-trained source encoder model could be easily fine-tuned to construct target encoder model and hence entirely enable transfer of T2I synthesis ability between languages. After that, a semantic-level alignment independent of source T2I structure is established to guarantee optimal text consistency and detail generation. Different from monolingual T2I methods that apply discriminator to enhance generation quality, we use an adversarial training scheme that optimizes the sentence-level alignment along with the word-level alignment with a self-attention mechanism. Considering of training for low-resource languages lack of parallel texts in practice, target input embedding is designed available for zero-shot learning. Experimental results prove robustness of the proposed cross-lingual T2I pre-training on multiple downstream generative models and target languages applied.

Full Text