Abstract
Content-based video moment retrieval (CVMR) aims to localize a successive sequence of frames in an untrimmed reference video, called target moment, that is semantically corresponding to a given query video. Current state-of-the-art CVMR methods are mainly developed using frame-level annotation, which is often quite expensive to collect. In this paper, we aim to develop a weakly-supervised CVMR method, which uses coarse-grained video-level annotations during training. Under weak supervision, video localizers require more discriminative frame-level video features. To achieve this goal, we proposed a novel prior, termed low-rank prior, based on an observation that the frame-level feature of a video should have low-rank properties. We demonstrated that the low-rank features are more discriminative and are beneficial to accurately localize the action boundaries. To produce a low-rank feature, we designed a low-rank feature reconstruction (LFR) operator. A new differentiable matrix decomposition approach is proposed to generate the low-rank reconstruction of the input matrix, meanwhile ensuring that the matrix decomposition process is differentiable. Based on the LFR, we developed a new weakly-supervised CVMR model which produces low-rank video representation and performs semantic consistency measures to discover the semantically matched segment in the reference video to the query video. Extensive experiments demonstrate that our method outperforms state-of-the-art weakly-supervised methods consistently and even achieves competing performance to fully-supervised baselines.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have