Abstract

In this paper, we tackle the task of natural language video localization (NLVL): given an untrimmed video and a description language query, the goal is to localize the temporal segment within the video that best describes the natural language description. NLVL is challenging at the intersection of language and video understanding because a video may contain multiple segments of interests and the language may describe complicated temporal dependencies. Though existing approaches have achieved good performance, most of them did not fully consider the inherent differences between language and video modalities. Here, we propose Moment Relation Network (MRN) to reduce the divergence of the probability distribution of these two modalities. Specifically, MRN trains video and language subnets, and then uses transfer learning techniques to map the extracted features into an embedding-shared space where we calculate the similarity of two modalities using Mahalanobis distance metric, which is used to localize moments. Extensive experiments on benchmark datasets show that the proposed MRN significantly outperforms the state-of-the-art under the widely used metrics by a large margin.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call