Abstract
Recent generative adversarial network based methods have shown promising results for the charming but challenging task of synthesizing images from text descriptions. These approaches can generate images with general shape and color but often produce distorted global structures with unnatural local semantic details. It is due to ineffectiveness of convolutional neural networks in capturing the high-level semantic information for pixel-level image synthesis. In this paper, we propose a Dual Attentional Generative Adversarial Network (DualAttn-GAN) in which the dual attention modules are introduced to enhance local details and global structures by attending to related features from relevant words and different visual regions. As one of the dual modules, the textual attention module is designed to explore the fine-grained interaction between vision and language. On the other hand, visual attention module models internal representations of vision from channel and spatial axes, which can better capture the global structures. Meanwhile, we apply an attention embedding module to merge multi-path features. Furthermore, we present an inverted residual structure to boost representation power of CNNs and apply spectral normalization to stabilize GAN training. With extensive experimental validation on two benchmark datasets, our method significantly improves state-of-the-art models over the evaluation metrics of inception score and Frechet inception distance.
Highlights
Synthesizing image from text description has been a hot topic crossing natural language processing and computer vision
To address the issues above, we propose the Dual Attentional Generative Adversarial Networks for synthesizing image from text description
ATTENTION EMBEDDING MODULE To enhance the representation of the image features, we summarize the functionality of these two attention modules
Summary
Synthesizing image from text description has been a hot topic crossing natural language processing and computer vision. It has significant impact on the applications of content production and advertisement design. The core challenge of text-to-image synthesis lies in generating visually realistic and semantically sensible pixels associated with text descriptions. One is the semantic gap between the word-level textual semantic conception and the pixel-level visual information. Due to the sparse mapping between text space and image space, one word may change some sub-region details of generated images. The incomplete text description lacks a lot of conditional information, which limits the ability to express the visual characteristics of the network
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.