Abstract

Given an untrimmed video and a sentence query, video moment retrieval is to locate a target video moment that semantically corresponds to the query. It is a challenging task that requires a joint understanding of natural language queries and video contents. However, video contains complex contents, including query-related and query-irrelevant contents, which brings difficulty for the joint understanding. To this end, we propose a query-aware video encoder to capture the query-related visual contents. Specifically, we design a query-guided block following each encoder layer to recalibrate the encoded visual features according to the query semantics. The core of query-guided block is a channel-level attention gating mechanism, which could selectively emphasize query-related visual contents and suppress query-irrelevant ones. Besides, to fully match with different levels of contents in videos, we learn hierarchical and structural query clues to guide the visual content capturing. We disentangle sentence query into a semantics graph and capture the local contexts inside the graph via a trilinear model as query clues. Extensive experiments on Charades-STA and TACoS datasets demonstrate the effectiveness of our approach, and we achieve the state-of-the-art on the two datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call