GMF-GAN: Gradual multi-granularity semantic fusion GAN for text-to-image synthesis

Dehu Jin,Guangju Li,Qi Yu,Lan Yu,Jia Cui,Meng Qi

doi:10.1016/j.dsp.2023.104105

Abstract

Text-to-image Synthesis is to obtain images with both authenticity and semantic consistency from natural language descriptions. However, in previous methods, the fusion of semantic information during image generation is insufficient. The multi-stage model fuses semantic information of different granularities in different stages, so the semantic information is not fully utilized. The single-stage model only aims at fusing coarse-grained semantic information, ignoring the detailed information in fine-grained semantics. To this end, we propose a gradual multi-granularity semantic fusion GAN, which gradually fuses multi-granularity semantic information into feature maps of different scales and realizes the full utilization of multi-granularity semantic information. GMF-GAN adopts two novel attention modules: Adaptive Sentence Attention Fusion (ASAF) module and Adaptive Word Attention Fusion (AWAF) module. The ASAF guides the generator to focus on sentence-related regional features, and adaptively fine-tunes feature maps via attention weights. The AWAF assigns greater weight to word-related regional features by modeling the importance of each word. We also propose a text-image consistency loss to ensure that the generated image is semantically consistent with the input text. Qualitative and quantitative experimental results on two benchmark datasets show that GMF-GAN outperforms the state-of-the-art models.

Full Text