Exploring Global and Local Linguistic Representations for Text-to-Image Synthesis

Ruifan Li,Xiaojie Wang,Guangwei Zhang,Fangxiang Feng,Ning Wang

doi:10.1109/tmm.2020.2972856

Abstract

The task of text-to-image synthesis is to generate photographic images conditioned on given textual descriptions. This challenging task has recently attracted considerable attention from the multimedia community due to its potential applications. Most of the up-to-date approaches are built based on generative adversarial network (GAN) models, and they synthesize images conditioned on the global linguistic representation. However, the sparsity of the global representation results in training difficulties on GANs and a shortage of fine-grained information in the generated images. To address this problem, we propose cross-modal global and local linguistic representations-based generative adversarial networks (CGL-GAN) by incorporating the local linguistic representation into the GAN. In our CGL-GAN, we construct a generator to synthesize the target images and a discriminator to judge whether the generated images conform with the text description. In the discriminator, we construct the cross-modal correlation by projecting the image representations at high and low levels onto the global and local linguistic representations, respectively. We design the hinge loss function to train our CGL-GAN model. We evaluate the proposed CGL-GAN on two publicly available datasets, the CUB and the MS-COCO. The extensive experiments demonstrate that incorporating fine-grained local linguistic information with cross-modal correlation can greatly improve the performance of text-to-image synthesis, even when generating high-resolution images.

Full Text