Abstract

In this paper, we propose an Attentional Concatenation Generative Adversarial Network (ACGAN) aiming at generating 1024 × 1024 high-resolution images. First, we propose a multilevel cascade structure, for text-to-image synthesis. During training progress, we gradually add new layers and, at the same time, use the results and word vectors from the previous layer as inputs to the next layer to generate high-resolution images with photo-realistic details. Second, the deep attentional multimodal similarity model is introduced into the network, and we match word vectors with images in a common semantic space to compute a fine-grained matching loss for training the generator. In this way, we can pay attention to the fine-grained information of the word level in the semantics. Finally, the measure of diversity is added to the discriminator, which enables the generator to obtain more diverse gradient directions and improve the diversity of generated samples. The experimental results show that the inception scores of the proposed model on the CUB and Oxford-102 datasets have reached 4.48 and 4.16, improved by 2.75% and 6.42% compared to Attentional Generative Adversarial Networks (AttenGAN). The ACGAN model has a better effect on text-generated images, and the resulting image is closer to the real image.

Highlights

  • IntroductionWith the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. e text to image as a basic problem in the field has attracted the attention and research of many scholars

  • In recent years, with the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. e text to image as a basic problem in the field has attracted the attention and research of many scholars

  • With the continuous development of Generative Adversarial Networks (GANs), it has been widely used to generate realistic high-quality images based on text descriptions. e commonly used method [2,3,4,5] encodes the entire text description into a global sentence vector, which is input to the generator as a condition variable of GAN to generate an image

Read more

Summary

Introduction

With the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. e text to image as a basic problem in the field has attracted the attention and research of many scholars. With the rise of artificial intelligence and deep learning, natural language processing and computer vision have become the hot research fields. Text to image is the generation of a realistic image that matches a given text description, requiring processing fuzzy and incomplete information in natural language descriptions. E commonly used method [2,3,4,5] encodes the entire text description into a global sentence vector, which is input to the generator as a condition variable of GAN to generate an image. Due to the large structural differences between text and images, the use of only word-level attention does not ensure the consistency of global semantics, while it is difficult to generate complex scenes; fine-grained word information is still not explicitly used for generating images

Objectives
Methods
Results
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.