Abstract

Recently, it has been shown that generative-adversarial-nets (GANs) can be directly utilized as an extension of traditional reinforcement-learning in image captioning tasks. However, the GANs-based methods generate captions as a function of only local points in the feature map without capturing non-local information. In this paper, a Multi-Attention mechanism is first proposed by utilizing both of the local and non-local evidence for more effective feature representation and reasoning in image captioning. Based on the mechanism, a Multi-Attention Generative Adversarial Image Captioning Network (MAGAN) is also proposed which contains a Multi-Attention generator and a Multi-Attention discriminator. The proposed generator is designed to generate more accurate sentences, while the proposed discriminator is employed to determine whether generated sentences are human described or machine generated. Extensive experiments are conducted to validate the proposed framework on MSCOCO benchmark dataset, and it achieves very competitive results evaluated by the evaluation server of MS COCO captioning challenge.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.