Abstract
Video captioning aims to describe the target objects, object attributes and object interactions contained in the video using the natural language. Existing methods can generate description that generally match the video content, but they are insufficient for generate finer interactions/action between objects, that is typically represented by the predicate in description sentences. In contrast to other parts of the sentence, the predicate is not only reliant on the dynamic action but also the static scene. Therefore, the relationship between its corresponding visual semantics and text semantics is difficult to align. In this research, we propose an Augmented Semantic Alignment model (ASA) for video captioning that explicitly learns finer actions by enhancing the semantic alignment relationship between video and text. Specifically, we first introduce a multimodal feature aggregation network to capture high-quality, more action-relevant video semantic features. And then we use an action-guided decoder to fuse video semantic features and predicate information representing actions, resulting in a finer action description. Validated on two public datasets, the proposed model can generate descriptions of finer actions, that exhibit better semantic alignment with dynamic content in videos.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.