Abstract

Cross-modal text to video retrieval aims to find relevant videos for given text queries, which is crucial for various real-world applications. The key to address this task is to build the correspondence between video and text such that the related samples from different modalities can be aligned. As the text (sentence) contains both nouns and verbs representing objects as well as their interactions, retrieving the relevant videos requires a fine-grained understanding of video contents---not only the semantic concepts (i.e., objects) but also the interactions between them. Nevertheless, the current approaches mostly represent videos with aggregated frame-level features for the learning of joint space and ignore the information of object interactions, which usually results in suboptimal retrieval performance. To improve the performance of cross-modal video retrieval, this paper proposes a framework that models videos as spatial-temporal graphs where nodes correspond to visual objects and edges correspond to the relations/interactions between objects. With the spatial-temporal graphs, the object interactions in frame sequences can be captured to enrich the video representations for joint space learning. Specifically, Graph Convolutional Network is introduced to learn the representations on spatial-temporal graphs, aiming to encode the spatial-temporal interactions between objects; while BERT is introduced to dynamically encode the sentence according to the context for cross-modal retrieval. Extensive experiments verify the effectiveness of the proposed framework and it achieves promising performances on both MSR-VTT and LSMDC datasets.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.