Abstract

Image captioning with a natural language has been an emerging trend. However, the social image, associated with a set of user-contributed tags, has been rarely investigated for a similar task. The user-contributed tags, which could reflect the user attention, have been neglected in conventional image captioning. Most existing image captioning models cannot be applied directly to social image captioning. In this work, a dual attention model is proposed for social image captioning by combining the visual attention and user attention simultaneously.Visual attention is used to compress a large mount of salient visual information, while user attention is applied to adjust the description of the social images with user-contributed tags. Experiments conducted on the Microsoft (MS) COCO dataset demonstrate the superiority of the proposed method of dual attention.

Highlights

  • Image caption generation is a hot topic in computer vision and machine learning

  • We propose a novel dual attention model (DAM) to explore the social image captioning based on visual attention and user attention

  • The image is commonly represented by a convolutional neural network (CNN) feature vector as the encoder, and the decoder part is usually modeled with recurrent neural networks (RNN)

Read more

Summary

Introduction

Image caption generation is a hot topic in computer vision and machine learning. Rapid development and great progress have been made in this area with deep learning recently. “Soft visual attention” [4] is proposed by Xu, where only visual features are used to generate image captions (see Figure 2a). We propose a novel dual attention model (DAM) to explore the social image captioning based on visual attention and user attention (see Figure 2c). Social image captioning is considered to generate diverse descriptions with corresponding user tags. User attention is proposed to address the different effects of generated visual descriptions and user tags, which lead to a personalized social image caption. A dual attention model is proposed for social image captioning to combine the visual attention and user attention simultaneously. In this situation, generated descriptions maintain accuracy and diversity

Related Work
Preliminaries
Dual Attention Model Architecture
Visual Attention
User Attention
Combination of Visual and User Attentions
Datasets and Evaluation Metrics
Overall Comparisons by Using Visual Attributes
Overall Comparison by Using Man-Made User Tags
The Influence of Noise on the Dual Attention Model
Qualitative Analysis
Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.