Abstract

Many existing video captioning methods capture action information in the video by exploiting features extracted from an action recognition model. However, directly using the action features without object-specific representation may not well capture the object interactions. Consequently, the generated captions may not be accurate enough in describing the action and the object in the scenes. To address this issue, we propose to incorporate the action features as the edge features in a graph neural network where the nodes represent objects, thereby capturing a finer visual representation of object-action-object relationships. Previous graph-based video captioning methods commonly relied on a pretrained object detection model to create the node representations. The object detection model, however, may miss detecting some important objects. To alleviate this problem, we further introduce a grid-based node representation where the nodes are represented by the features extracted from grids of video frames. Using this representation, the important objects in the scenes are captured more thoroughly. To avoid adding any complexity during inference, the knowledge of the proposed graph is transferred to another neural network via knowledge distillation. Our proposed method achieved state-of-the-art results on two popular video captioning datasets, i.e., MSVD and MSR-VTT, on all metrics. The code of our proposed method is available at https://github.com/Sejong-VLI/V2T-Action-Graph-JKSUCIS-2023.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call