Abstract

Text-to-image synthesis aims to generate realistic images conditioned on text descriptions. To fuse text information into synthesized images, conditional affine transformations (CATs), such as conditional batch normalization (CBN) and conditional instance normalization (CIN), are usually used to predict batch statistics of different layers. However, ordinary CAT blocks control the batch statistics independently disregarding the consistency among neighboring layers. To address the above issue, we propose a new fusion approach names recurrent affine transformation (RAT) for synthesizing images conditioned on text information. RAT connects all the CAT blocks with recurrent connections for explicitly fitting the temporal consistency between CAT blocks. To verify the effectiveness of RAT, we propose a novel visualization method to show how generative adversarial network (GAN) fuses conditional information. Our microscopic and macroscopic visualizations not only demonstrate the effectiveness of RAT but also turn out to be a useful perspective to analyze how GAN fuses conditional information. In addition, we propose a more stable spatial attention mechanism for the discriminator, which helps the text description to supervise the generator to synthesize more relevant image contents. Extensive experiments on the CUB, Oxford-102, and COCO datasets demonstrate the proposed model's superiority in comparison to state-of-the-art models. Our code is available on <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/senmaoy/RAT-GAN</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call