Abstract

Text-to-image generation (T2I) aims to produce visually compelling images while maintaining a high degree of semantic consistency with textual descriptions. Despite the impressive progress made by existing methods, there are problems with limited details of synthesized images and insufficient correlation between the provided text description and the generated images. To address these issues, we propose a Context-Aware Generative Adversarial Network (CA-GAN), which generates images aligned with the input text representations. Specifically, the Context-Aware Block (CA-Block) learns a semantic-adaptive transformation based on text style, enabling the effective fusion of text descriptions and image features for high-quality image generation with better language-vision matching. Furthermore, we propose an Attention Convolution Module (ACM) to identify greater representative traits and avoid the inability to capitalize on non-local contextual information, which enables our model to generate images that exhibit numerous detailed attributes while maintaining high-quality and semantic consistency. Thereafter, we integrate self-attention with convolution to enhance feature maps to reinforce the semantic information in the discriminator, emphasizing critical feature channels while suppressing extraneous information, ultimately yielding more detailed and richer images. The experimental results demonstrate the superiority of our method over the SOTA approaches. In addition, further studies confirm the effectiveness of the generated visual details, which exhibit a high degree of alignment with the input text descriptions. Notably, our attention mechanism showcases cooperative effects contributing to overall performance improvement. The code is available at: https://github.com/hylneu/CAGAN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call