Abstract

Cross-modal retrieval between texts and videos is important yet challenging. Until recently, previous works in this domain typically rely on learning a common space to match the text and video, but it is difficult to match due to the semantic gap between videos and texts. Although some methods employ coarse-to-fine or multi-expert networks to encode one or more common spaces for easier matching, they almost directly optimize one matching space, which is challenging, because of the huge semantic gap between different modalities. To address this issue, we aim at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and propose a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel encoding network for videos and texts, and design some auxiliary common spaces, which are mapped by the outputs of encoders in different levels. Then all the common spaces are jointly trained end to end. In this way, the model can effectively encode videos and texts into a fusion common space by a progressive paradigm. Experimental results on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages of our PSM, which achieves significant performance improvement compared with state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call