Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Qingrong Cheng,Keyu Wen,Xiaodong Gu

doi:10.1109/tmm.2022.3217384

Abstract

Text-to-image synthesis is an attractive but challenging task that aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN*, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN* firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluate the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between the synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">${\rm{VLMGAN}_{+\rm{AttnGAN}}}$</tex-math></inline-formula> and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">${\rm{VLMGAN}_{+\rm{DFGAN}}}$</tex-math></inline-formula> . The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia

Lead the way for us

Journal: IEEE Transactions on Multimedia	Publication Date: Jan 1, 2023
Citations: 8

Similar Papers

CKD: Cross-Task Knowledge Distillation for Text-to-Image Synthesis
Mingkuan Yuan ... Yuxin Peng
IEEE Transactions on Multimedia | VOL. 22
Mingkuan Yuan, et. al.Mingkuan Yuan ... Yuxin Peng
22 Nov 2019
IEEE Transactions on Multimedia | VOL. 22

LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation
Zijun Deng ... Yuxin Peng
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19
Zijun Deng, et. al.Zijun Deng ... Yuxin Peng
12 Jul 2023
ACM Transactions on Multimedia Computing, Communications, and Applications | VOL. 19

Research on Self-Attention Image Description Technology Based on Object Detection
Kun Ma ... Tao Xu
-
Kun Ma, et. al.Kun Ma ... Tao Xu
01 Oct 2022
01 Oct 2022

Semantic layout aware generative adversarial network for text-to-image generation
Jieyu Huang ... Wenjun Zhang
-
Jieyu Huang, et. al.Jieyu Huang ... Wenjun Zhang
23 May 2023
23 May 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Multimedia