Abstract

Text-based image manipulation is a popular subject and has many applications. However, it is a challenging task because there is no ground-truth edited dataset and textual descriptions have abstractive and ambiguous properties. To alleviate the difficult issues, we propose a manipulation framework consisting of the proposal attentional GANs, language-related semantic mask, and language-guided ranker. Specially, we construct an editing proposal generator to generate the suitable edited proposals with and without semantic conditions, which supports the reorganization of sub-generators to output proposals in various aspects as many as possible. To distinguish the text-relevant and the text-irrelevant regions, we introduce a language-related semantic mask based on the source image and target caption. Then, we exploit a language-guided ranker to retrieve the best edited result from the edited proposals through using the multi-modal similarity and the language-related semantic mask. Extensive experiments on widely-used datasets demonstrate that our model could manipulate images interactively and improve the editing quality effectively.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call