Abstract

Image captioning is a task to generate natural descriptions of images. In existing image captioning models, the generated captions usually lack semantic discriminability. Semantic discriminability is difficult as it requires the model to capture detailed differences in images. In this paper, we propose an image captioning framework with semantic-enhanced features and extremely hard negative examples. These two components are combined in a Semantic-Enhanced Module. The semantic-enhanced module consists of an image-text matching sub-network and a Feature Fusion layer which provides semantic-enhanced features of rich semantic information. Moreover, in order to improve the semantic discriminability, we propose an extremely hard negative mining method which utilize the extremely hard negative examples to improve the latent alignment between visual and language information. Experimental results on MSCOCO and Flickr30K show that our proposed framework and training method can simultaneously improve the performance of image-text matching and image captioning, achieving competitive performance against state-of-the-art methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.