Abstract

Multi-turn text-to-image synthesis task aims to manipulate desired visual content according to the user's intention step by step, which has recently attracted a lot of research interest in the community of language and vision. Different from traditional text-to-image synthesis, multi-turn text-to-image synthesis is more challenging as 1) it needs to continuously recognize the user's intention from spoken instruction and perceive the visual information from the source image; 2) it requires reasoning about the position, appearance, and characteristics of fresh modifications in target images as well as connecting objects in instructions with visual components in source images. To deal with this issue, in this paper, we propose a Dual Semantic-stream Guidance with global and local linguistics Generative Adversarial Network (DSG-GAN), which reasons and learns the user's intention from text description and iteratively manipulates visual information. Specifically, we design a novel dual semantic-stream discriminator, which combines with a hierarchical instruction encoder to evaluate the logic and quality between human intention in linguistic instruction and generates visual content from the perspective of global and fine-grained consistency matching. Meanwhile, the discriminator's backpropagation gradient is used to optimize the instruction encoder, which incentivizes it to purify the user's intention into global and local information that is consistent with the manipulation's visual representation. Extensive experiments show that even when producing high-resolution images and making deep iterative turns, our method performs significantly better due to local fine-grained linguistic information being combined with cross-modal correlation.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.