Abstract
Effectively summarizing and re-expressing video content by natural languages in a more human-like fashion is one of the key topics in the field of multimedia content understanding. Despite good progress made in recent years, existing efforts usually overlooked the emotions in user-generated videos, thus making the generated sentence a bit boring and soulless. To fill the research gap, this paper presents a novel emotional video captioning framework in which we design a Vision-based Emotion Interpretation Network to effectively capture the emotions conveyed in videos and describe the visual content in both factual and emotional languages. Specifically, we first model the emotion distribution over an open psychological vocabulary to predict the emotional state of videos. Then, guided by the discovered emotional state, we incorporate visual context, textual context, and visual-textual relevance into an aggregated multimodal contextual vector to enhance video captioning. Furthermore, we optimize the network in a new emotion-fact coordinated way that involves two losses- Emotional Indication Loss and Factual Contrastive Loss, which penalize the error of emotion prediction and visual-textual factual relevance, respectively. In other words, we innovatively introduce emotional representation learning into an end-to-end video captioning network. Extensive experiments on public benchmark datasets, EmVidCap and EmVidCap-S, demonstrate that our method can significantly outperform the state-of-the-art methods by a large margin. Quantitative ablation studies and qualitative analyses clearly show that our method is able to effectively capture the emotions in videos and thus generate emotional language sentences to interpret the video content.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.