Video Captioning based on Augmented Semantic Alignment

Yuanyuan Liu,Dong Wang,Hong Zhu,Sen Du,Yujia Zhang,Jing Shi

doi:10.1109/cac57257.2022.10055478

Abstract

Video captioning aims to describe the target objects, object attributes and object interactions contained in the video using the natural language. Existing methods can generate description that generally match the video content, but they are insufficient for generate finer interactions/action between objects, that is typically represented by the predicate in description sentences. In contrast to other parts of the sentence, the predicate is not only reliant on the dynamic action but also the static scene. Therefore, the relationship between its corresponding visual semantics and text semantics is difficult to align. In this research, we propose an Augmented Semantic Alignment model (ASA) for video captioning that explicitly learns finer actions by enhancing the semantic alignment relationship between video and text. Specifically, we first introduce a multimodal feature aggregation network to capture high-quality, more action-relevant video semantic features. And then we use an action-guided decoder to fuse video semantic features and predicate information representing actions, resulting in a finer action description. Validated on two public datasets, the proposed model can generate descriptions of finer actions, that exhibit better semantic alignment with dynamic content in videos.

Full Text