Abstract

Text-to-image synthesis is the task of synthesizing realistic and text-matching images according to given text descriptions. Most text-to-image generative networks consist of two modules: a pre-trained text-image encoder and a text-to-image generative adversarial network. In this paper, we propose a stronger text encoder which employs a text Transformer to extract semantically meaningful parts from text descriptions. With the stronger text encoder, the generator can obtain more meaningful text information to synthesize realistic and text-matching images. In addition, we propose a Dynamic Convolutional text-image Fusion Generative Adversarial Network (DCFGAN) which employs the Dynamic Convolutional Fusion Block to fuse text and image features efficiently. The Dynamic Convolutional Fusion block adjusts the parameters in the convolution layer according to different text descriptions to synthesize text-matching images. It improves the efficiency of fusing text features and image features in generator network. We evaluate the proposed DCF-GAN on two benchmark datasets, the CUB and the Oxford-102. The extensive experiments demonstrate that our proposed stronger text encoder and Dynamic Convolutional Fusion Layer can greatly promote the performance of text-to-image synthesis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call