Abstract

Recently, text-to-image synthesis has achieved great progresses with the advancement of the Generative Adversarial Network (GAN). However, training the GAN models requires a large amount of pairwise image-text data, which is extremely labor-intensive to collect. In this paper, we make the first attempt to train a text-to-image synthesis model in an unsupervised manner, which does not require any human labeled image-text pair data. Specifically, we first rely on the visual concepts to bridge two independent image and sentence sets and thereby yield the pseudo image-text pair data, based on which one GAN model can thereby be initialized. One novel visual concept discrimination loss is proposed to train both generator and discriminator, which not only encourages the image expressing the true local visual concepts but also ensures the noisy visual concepts contained in the pseudo sentence being suppressed. Afterwards, one global semantic consistency regarding to the real sentence is used to adapt the pretrained GAN model to real sentences. Experimental results demonstrate that our proposed unsupervised training strategy is able to generate favorable images for given sentences, which even outperforms some existing models trained in the supervised manner. The code of this paper is available at https://github.com/dylls/Unsupervised_Text-to-Image_Synthesis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.