Abstract
Image captioning with a natural language has been an emerging trend. However, the social image, associated with a set of user-contributed tags, has been rarely investigated for a similar task. The user-contributed tags, which could reflect the user attention, have been neglected in conventional image captioning. Most existing image captioning models cannot be applied directly to social image captioning. In this work, a dual attention model is proposed for social image captioning by combining the visual attention and user attention simultaneously.Visual attention is used to compress a large mount of salient visual information, while user attention is applied to adjust the description of the social images with user-contributed tags. Experiments conducted on the Microsoft (MS) COCO dataset demonstrate the superiority of the proposed method of dual attention.
Highlights
Image caption generation is a hot topic in computer vision and machine learning
We propose a novel dual attention model (DAM) to explore the social image captioning based on visual attention and user attention
The image is commonly represented by a convolutional neural network (CNN) feature vector as the encoder, and the decoder part is usually modeled with recurrent neural networks (RNN)
Summary
Image caption generation is a hot topic in computer vision and machine learning. Rapid development and great progress have been made in this area with deep learning recently. “Soft visual attention” [4] is proposed by Xu, where only visual features are used to generate image captions (see Figure 2a). We propose a novel dual attention model (DAM) to explore the social image captioning based on visual attention and user attention (see Figure 2c). Social image captioning is considered to generate diverse descriptions with corresponding user tags. User attention is proposed to address the different effects of generated visual descriptions and user tags, which lead to a personalized social image caption. A dual attention model is proposed for social image captioning to combine the visual attention and user attention simultaneously. In this situation, generated descriptions maintain accuracy and diversity
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.