Adaptive multi-text union for stable text-to-image synthesis learning

Yan Zhou,Jiechang Qian,Huaidong Zhang,Xuemiao Xu,Huajie Sun,Fanzhi Zeng,Yuexia Zhou

doi:10.1016/j.patcog.2024.110438

Abstract

Generative Adversarial Networks (GANs) have significantly boosted the performance of text-to-image generation tasks in recent years. To train the generator, losses like reconstruction loss or adversarial loss between the generated image and ground truth image are widely adopted by recent works. These losses are all built over one assumption: the given text descriptions can describe the corresponding image perfectly. Unfortunately, this assumption is not satisfied in many cases, especially in datasets like COCO with complicated scenes due to the variance of annotator experience and focal point. This paper addresses this issue by proposing a multi-text-to-image training framework, which adaptively adjusts the weights of all the text descriptions corresponding to a specific image to generate the union description features. With the union description features, the generator can generate more visual-consistent images and mitigate the negative optimization caused by incomplete or inconsistent text descriptions. To better measure the similarity between generated images and multi-text descriptions, we also reformulate the process of multi-modal matching loss to better measure the similarity between image and multi-text descriptions. Extensive experiments on relevant benchmarks CUB and COCO prove the proposed method’s effectiveness and superiority compared to state-of-the-art methods.

Full Text