Abstract

In this paper, we target the challenging task of video-text retrieval. The common way for this task is to learn a text-video joint embedding space by cross-modal representation learning, and compute the cross-modality similarity in the joint space. As videos typically contain rich information, how to represent videos in a joint embedding space is crucial for video-text retrieval. The majority of works typically depend on pre-extracted frame-level features or clip-level features for video representation, which may cause fine-grained object information in videos to be ignored. To alleviate it, we explicitly introduce more fine-grained object-level features to enrich video representation. In order to exploit the potential of the object-level features, we propose a new model named FeatInter, which jointly considers the visual and semantic features of objects. Besides, a visual-semantic interaction and a cross-feature interaction are proposed to mutually enhance object features and frame features. Extensive experiments on two challenging video datasets, i.e., MSR-VTT and TGIF, demonstrate the effectiveness of our proposed model. Moreover, our model achieves a new state-of-the-art on TGIF. While the state-of-the-art methods use seven video features on MSR-VTT, our model with just three features obtains comparable performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call