Abstract. This research looks at text-to-image generation as a whole, comparing two popular modelsStacked Generative Adversarial Networks (StackGAN) and Attentional Generative Adversarial Networks (AttnGAN)and their respective strengths and weaknesses. Text-to-image generation has seen significant advancements with the introduction of GAN-based models, and this paper aims to explore how these models perform in terms of image quality, realism, and alignment with textual descriptions. Using the Caltech-UCSD Birds (CUB)-200-2011 dataset, which consists of bird images, extensive experiments were conducted to evaluate and compare the capabilities of the two models. The results indicate that AttnGAN outperforms StackGAN across multiple metrics, particularly in the accuracy of detail alignment and overall image realism. AttnGAN's multi-level attention mechanism allows it to pay attention to specific textual elements when generating related sections of the image, resulting in more aesthetically pleasing and semantically consistent outputs. Despite these advancements, challenges remain in improving both the diversity and quality of generated images. This work offers substantial insights into the capabilities and constraints of existing models, providing guidance for future research with the aim of improving text-to-image generation.
Read full abstract