Abstract
Most works of image captioning are implemented under the full supervision of paired image-caption data. Limited to expensive cost of data collection, the task of unpaired image captioning has attracted researchers' attention. In this article, we propose a novel memorial GAN (MemGAN) with the joint semantic optimization for unpaired image captioning. The core idea is to explore implicit semantic correlation between disjointed images and sentences through building a multimodal semantic-aware space (SAS). Concretely, each modality is mapped into a unified multimodal SAS, where SAS includes the semantic vectors of image I, visual concepts O, unpaired sentence S, and the generated caption C. We adopt the memory unit based on multihead attention and relational gate as a backbone to preserve and transit crucial multimodal semantics in the SAS for image caption generation and sentence reconstruction. Then, the memory unit is embedded into a GAN framework to exploit the semantic similarity and relevance in SAS, that is, imposing a joint semantic-aware optimization on SAS without supervision clues. To summarize, the proposed MemGAN learns the latent semantic relevance of SAS's multimodalities in an adversarial manner. Extensive experiments and qualitative results demonstrate the effectiveness of MemGAN, achieving improvements over state of the arts on unpaired image captioning benchmarks.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.