Abstract

In this paper, we propose a method called LSAI (learning semantic alignment from image) to recover the corrupted image patches for text-guided image inpainting. Firstly, a multimodal preliminary (MP) module is designed to effectively encode global features for images and textual descriptions, where each local image patch and word are taken into account via multi-head self-attention. Secondly, non-Euclidean semantic relations between images and textual descriptions are captured with graph structure by building a semantic relation graph (SRG). The constructed SRG is able to obtain meaningful words describing the image content and alleviate the impact of distracting words, which is achieved by aggregating the semantic relations with graph convolution. In addition, a text-image matching loss is devised to penalize the restored images for diverse textual and visual semantics. Quantitative and qualitative experiments conducted on two public datasets show the outperformance of our proposed LSAI (e.g., FID value is reduced from 30.87 to 16.73 on CUB-200-2011 dataset).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.