SWF-GAN: A Text-to-Image model based on sentence–word fusion perception

Chun Liu,Jingsong Hu,Hong Lin

doi:10.1016/j.cag.2023.07.038

Abstract

Synthesizing images from descriptive text is an exciting and challenging task in multimodal deep learning, which has broad prospects of application in the fields of visual reasoning, image editing, style migrating, and so on. This paper proposes SWF-GAN to solve such problems: the limited constraint of coarse-grained information leads to difficulties in building semantic mappings of text-to-image accurately and ordinary mask predictors do not have enough representational capacity to accurately perceive the global information of images. SWF-GAN designs a sentence–word fusion perceptual module which divides the semantic perception of the generative model into two major layers, sentence and word, building affine transformations to constrain image synthesis using the coarse-grained feature on the sentence level and specific image synthesis using the fine-grained feature on the word level. Additionally, a weakly supervised coordinate mask predictor is employed in the sentence layer, extracting long-range dependencies with precise positional information vertically and horizontally to assign more information to the subject in the complex image background thus accurately generating the structure of the target object. The experiments show that the sentence–word fusion perceptual generative adversarial network model proposed in this paper can generate clearer and more lively images without a heavy computational burden. Compared with the baseline model, the proposed model improves the IS and FID scores by 0.97% and 22.95% respectively, and the experimental results on different datasets and the ablation study results show the effectiveness of our model.

Full Text