Abstract

The goal of cross-modal moment localization is to find the temporal moment in the untrimmed video that semantically corresponds to the natural language query. The majority of current approaches learn the cross-modal moment localization models from fine-grained temporal annotations in the video, which are extremely time-consuming and labor-intensive to obtain. In this paper, we offer a novel framework for weakly supervised cross-modal moment localization that incorporates a proposal generation module and a semantic reconstruction module. The proposal generation module uses a two-dimensional temporal feature map to model cross-modal video representations and can encode the moment-by-moment temporal relationships of moment candidates. The semantic reconstruction module, which is based on the generated proposals, assesses a proposal's capacity to restore the text query and provides weak supervision for network training. Besides, a punishment loss is also proposed to further eliminate the effect of the invalid area. Extensive experimental results show that the proposed method achieves state-of-the-art performance, demonstrating its effectiveness for weakly supervised moment localization with natural language. • A new framework is proposed for weakly supervised cross-modal moment localization. • A multi-task loss is designed for the weakly supervised optimizing of the network. • The proposal generation module could exploit moment-wise temporal relationships. • Extensive experimental results demonstrate the effectiveness of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call