Abstract

Video captioning can be enhanced by incorporating the knowledge, which is usually represented as relationships of objects. However, the previous methods construct only superficial or static object relationships, and often introduce noise into the task through irrelevant common sense or fixed syntax templates. These problems mitigate the model interpretability and lead to the undesirable consequence. To overcome these limitations, we propose to enhance video captioning with deep-level object relationships that are adaptively explored during training. Specifically, we present a Transitive Visual Relationship Detection (TVRD) module in which we estimate the actions of the visual objects, and construct an Object-Action Graph (OAG) to describe the shallow relationship between the objects and actions. Then we bridge the gap between the objects via the actions to transitively infer an Object-Object Graph (OOG) which reflects the deep-level relationship. We further feed the OOG to a graph convolutional network to refine the object representation by deep-level relationships. With the refined representation, we capitalize on an LSTM-based decoder for caption generation. Experimental results on two benchmark datasets: MSVD, MSR-VTT demonstrate that the proposed method achieves state-of-the-art performance. Lastly, we present comprehensive ablation studies as well as visualization of visual relationships to demonstrate the effectiveness and interpretability of our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call