Semantic Distance Adversarial Learning for Text-to-Image Synthesis

Bowen Yuan,Changsheng Xu,Bing-Kun Bao,Yefei Sheng,Yi-Ping Phoebe Chen

doi:10.1109/tmm.2023.3278992

Abstract

Text-to-Image (T2I) synthesis is a cross-modality task that requires a text description as input to generate a realistic and semantically consistent image. To guarantee semantic consis- tency, previous studies regenerate text descriptions from synthetic images and align them with the given descriptions. However, the existing redescription modules lack explicit modeling of their training objectives, which is crucial for reliable measurement of semantic distance between redescriptions and given text inputs. Consequently, the aligned text redescriptions suffer from training bias caused by the emergence of adversarial image samples, unseen semantics, and mistaken contents from low- quality synthesized images. To this end, we propose a SEMantic distance Adversarial learning (SEMA) framework for Text-to- Image synthesis which strengthens semantic consistency from two aspects: 1) We introduce adversarial learning between the image generator and the text redescription module to mutually promote or demote the quality of generated image or text instances. This learning model ensures accurate redescription of image contents, thus diminishing the generation of adversarial image samples. 2) We introduce two-fold semantic distance discrimination (SEM distance) to characterize semantic relevance between matching text or image pairs. The unseen semantics and mistaken contents will be penalized with a large SEM distance. The proposed discrimination method also simplifies the model training process with no need to optimize multiple discriminators. Experimental results on CUB Birds 200 and MS-COCO datasets show that the proposed model outperforms the state-of-the-art methods. Code is available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/NJUPT-MCC/SEMA</uri> .

Full Text