Realistic Image Generation from Text by Using BERT-Based Embedding

Sanghyuck Na,Kyeonah Yu,Mirae Do,Juntae Kim

doi:10.3390/electronics11050764

Sanghyuck Na, Kyeonah Yu + Show 2 more

Open Access

https://doi.org/10.3390/electronics11050764

Copy DOI

Journal: Electronics	Publication Date: Mar 2, 2022
Citations: 3	License type: CC BY 4.0

Affiliation: Dongguk University, Duksung Women's University

Abstract

Recently, in the field of artificial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the field of natural language processing. The pre-trained BERT is used as a text encoder by performing fine-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.

Highlights

Many deep learning methods have been developed for single modality, the real world we experience is multimodal, so research on multimodal deep learning is essential for AI to make meaningful progress [1]
The model proposed in this paper is denoted as StackGAN+BERT
The space between data in the text manifold is small, so that features can be extracted from texts that are not seen during training in the finetuning process

Summary

Introduction

Many deep learning methods have been developed for single modality, the real world we experience is multimodal, so research on multimodal deep learning is essential for AI to make meaningful progress [1]. It is necessary to generate high-quality images similar to real images through text feature representation. We propose a text-to-image model using BERT-based embedding and high-quality image generation using StackGAN. Existing text-to-image generation studies have a problem of creating empty spaces between data in the text manifold by using a pre-trained text encoder for a zero-shot visual recognition task. We try to solve this problem by fine-tuning the pre-trained BERT to be suitable for the text-to-image generation task. When the fine-tuned BERT is used as a text encoder, there is little space between data in the text manifold, so text representation can be effectively extracted, and it is shown that it is possible to generate a more realistic image compared with existing studies using efficient embedding. The proposed method shows qualitative and quantitative performance improvement compared with the existing methodologies on the CUB multimodal benchmark

Related Work

BERT-Based Text Embedding

Generating Low-Resolution Images from Text

Experiments

Datasets

Evaluation Metric

The Compared Models

Quantitative Results

Qualitative Results

Discussion and Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Realistic Image Generation from Text by Using BERT-Based Embedding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Chapter 2 - GAN models in natural language processing and image translation
E Thirumagal ... K Saruladha
Generative Adversarial Networks for Image-to-Image Translation | VOL. -
E Thirumagal, et. al.E Thirumagal ... K Saruladha
01 Jan 2020
Generative Adversarial Networks for Image-to-Image Translation | VOL. -

Generating images using generative adversarial networks based on text descriptions
Marzhan Turarova ... Roza Bekbayeva
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14
Marzhan Turarova, et. al.Marzhan Turarova ... Roza Bekbayeva
01 Apr 2024
International Journal of Electrical and Computer Engineering (IJECE) | VOL. 14

STABLE DIFFUSION TEXT TO IMAGE USING AI
Prof Seema R
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08
Prof Seema RProf Seema R
08 May 2024
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08

Leveraging Multimodality for Biodiversity Data: Exploring joint representations of species descriptions and specimen images using CLIP
Maya Sahraoui ... Marc Pignal
Biodiversity Information Science and Standards | VOL. 7
Maya Sahraoui, et. al.Maya Sahraoui ... Marc Pignal
14 Sep 2023
Biodiversity Information Science and Standards | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Realistic Image Generation from Text by Using BERT-Based Embedding

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics