Abstract

Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.

Highlights

  • Many deep learning methods have been developed for single modality, the real world we experience is multimodal, so research on multimodal deep learning is essential for AI to make meaningful progress [1]

  • The model proposed in this paper is denoted as StackGAN+BERT

  • The space between data in the text manifold is small, so that features can be extracted from texts that are not seen during training in the finetuning process

Read more

Summary

Introduction

Many deep learning methods have been developed for single modality, the real world we experience is multimodal, so research on multimodal deep learning is essential for AI to make meaningful progress [1]. It is necessary to generate high-quality images similar to real images through text feature representation. We propose a text-to-image model using BERT-based embedding and high-quality image generation using StackGAN. Existing text-to-image generation studies have a problem of creating empty spaces between data in the text manifold by using a pre-trained text encoder for a zero-shot visual recognition task. We try to solve this problem by fine-tuning the pre-trained BERT to be suitable for the text-to-image generation task. When the fine-tuned BERT is used as a text encoder, there is little space between data in the text manifold, so text representation can be effectively extracted, and it is shown that it is possible to generate a more realistic image compared with existing studies using efficient embedding. The proposed method shows qualitative and quantitative performance improvement compared with the existing methodologies on the CUB multimodal benchmark

Related Work
BERT-Based Text Embedding
Generating Low-Resolution Images from Text
Experiments
Datasets
Evaluation Metric
The Compared Models
Quantitative Results
Qualitative Results
Discussion and Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.