Recurrent Affine Transformation for Text-to-Image Synthesis

Senmao Ye,Huan Wang,Mingkui Tan,Fei Liu

doi:10.1109/tmm.2023.3266607

Abstract

Text-to-image synthesis aims to generate realistic images conditioned on text descriptions. To fuse text information into synthesized images, conditional affine transformations (CATs), such as conditional batch normalization (CBN) and conditional instance normalization (CIN), are usually used to predict batch statistics of different layers. However, ordinary CAT blocks control the batch statistics independently disregarding the consistency among neighboring layers. To address the above issue, we propose a new fusion approach names recurrent affine transformation (RAT) for synthesizing images conditioned on text information. RAT connects all the CAT blocks with recurrent connections for explicitly fitting the temporal consistency between CAT blocks. To verify the effectiveness of RAT, we propose a novel visualization method to show how generative adversarial network (GAN) fuses conditional information. Our microscopic and macroscopic visualizations not only demonstrate the effectiveness of RAT but also turn out to be a useful perspective to analyze how GAN fuses conditional information. In addition, we propose a more stable spatial attention mechanism for the discriminator, which helps the text description to supervise the generator to synthesize more relevant image contents. Extensive experiments on the CUB, Oxford-102, and COCO datasets demonstrate the proposed model's superiority in comparison to state-of-the-art models. Our code is available on <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/senmaoy/RAT-GAN</uri> .

Full Text