Abstract
Video moment retrieval aims to retrieve a target moment from an untrimmed video that semantically corresponds to the given language query. Existing methods commonly treat it as a regression task or a ranking task from the perspective of computer vision. Most of these works neglect comprehensive relations between video content and language context at a multi-granularity level and fail to efficiently model temporal relations among different video moments. In this paper, we formulate video moment retrieval into video reading comprehension by treating the input video as a text passage and language query as a question. To tackle the above impediments, we propose a Comprehensive Relation-aware Network (CRNet) to perceive comprehensive relations from extensive aspects. Specifically, we unite visual and textual features simultaneously at both clip-level and moment-level to thoroughly exploit inter-modality information, leading to a coarse-and-fine cross-modal interaction. Moreover, a background suppression module is introduced to restrain irrelevant background clips, meanwhile, a novel IoU attention mechanism and graph attention layer are efficiently devised to focus on the dependencies among highly-correlated video moments for the best choice selection. In-depth experiments on three public datasets TACoS, ActivityNet Captions, and Charades-STA demonstrate the superiority of our solution.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have