Abstract

Video paragraph captioning aims to generate multiple descriptive sentences for videos, which strive to replicate human writing in accuracy, logicality, and richness. However, current research focuses on the accuracy and temporal order of events, ignoring emotion and other critical logical relations embedded in human language, such as causal and adversative relations. The ignorance impairs the reasonable transition across generated event descriptions and restricts the vividness of expression, resulting in a gap from the standard of human language. To resolve the problem, a framework that integrates logic and emotion representation learning is proposed to narrow the gap. Concretely, a large-scale inter-event relation corpus is constructed based on the EMVPC dataset. This corpus is named EMVPC-EvtRel (standing for “EMVPC-Event Relations”) and contains six widely-used logical relations in human writing, 127 explicit inter-sentence connectives, and over 20,000 pairs of event segments with newly annotated logical relations. A logical semantic representation learning method is developed for recognizing the dependencies between visual events, thereby enhancing the characteristics of video contents and boosting the logicality of generated paragraphs. Moreover, a fine-grained emotion recognition module is designed to uncover emotion features embedded in videos. Finally, experimental results on the EMVPC dataset demonstrate the superiority of the proposed method compared to existing state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call