Unsupervised text-to-image synthesis

Yanlong Dong,Ying Zhang,Lin Ma,Zhi Wang,Jiebo Luo

doi:10.1016/j.patcog.2020.107573

Abstract

Recently, text-to-image synthesis has achieved great progresses with the advancement of the Generative Adversarial Network (GAN). However, training the GAN models requires a large amount of pairwise image-text data, which is extremely labor-intensive to collect. In this paper, we make the first attempt to train a text-to-image synthesis model in an unsupervised manner, which does not require any human labeled image-text pair data. Specifically, we first rely on the visual concepts to bridge two independent image and sentence sets and thereby yield the pseudo image-text pair data, based on which one GAN model can thereby be initialized. One novel visual concept discrimination loss is proposed to train both generator and discriminator, which not only encourages the image expressing the true local visual concepts but also ensures the noisy visual concepts contained in the pseudo sentence being suppressed. Afterwards, one global semantic consistency regarding to the real sentence is used to adapt the pretrained GAN model to real sentences. Experimental results demonstrate that our proposed unsupervised training strategy is able to generate favorable images for given sentences, which even outperforms some existing models trained in the supervised manner. The code of this paper is available at https://github.com/dylls/Unsupervised_Text-to-Image_Synthesis.

Full Text