Abstract

Video captioning aims to describe the target objects, object attributes and object interactions contained in the video using the natural language. Existing methods can generate description that generally match the video content, but they are insufficient for generate finer interactions/action between objects, that is typically represented by the predicate in description sentences. In contrast to other parts of the sentence, the predicate is not only reliant on the dynamic action but also the static scene. Therefore, the relationship between its corresponding visual semantics and text semantics is difficult to align. In this research, we propose an Augmented Semantic Alignment model (ASA) for video captioning that explicitly learns finer actions by enhancing the semantic alignment relationship between video and text. Specifically, we first introduce a multimodal feature aggregation network to capture high-quality, more action-relevant video semantic features. And then we use an action-guided decoder to fuse video semantic features and predicate information representing actions, resulting in a finer action description. Validated on two public datasets, the proposed model can generate descriptions of finer actions, that exhibit better semantic alignment with dynamic content in videos.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call