Abstract

Generative Adversarial Networks (GANs) have significantly boosted the performance of text-to-image generation tasks in recent years. To train the generator, losses like reconstruction loss or adversarial loss between the generated image and ground truth image are widely adopted by recent works. These losses are all built over one assumption: the given text descriptions can describe the corresponding image perfectly. Unfortunately, this assumption is not satisfied in many cases, especially in datasets like COCO with complicated scenes due to the variance of annotator experience and focal point. This paper addresses this issue by proposing a multi-text-to-image training framework, which adaptively adjusts the weights of all the text descriptions corresponding to a specific image to generate the union description features. With the union description features, the generator can generate more visual-consistent images and mitigate the negative optimization caused by incomplete or inconsistent text descriptions. To better measure the similarity between generated images and multi-text descriptions, we also reformulate the process of multi-modal matching loss to better measure the similarity between image and multi-text descriptions. Extensive experiments on relevant benchmarks CUB and COCO prove the proposed method’s effectiveness and superiority compared to state-of-the-art methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call