Understanding what is happening in the surveillance video is important for human-machine interface in transportation systems, where temporal language grounding is one of the key tasks, targeting at localizing the desired moment in an untrimmed video with a given sentence query that is relevant to the moment. This task is challenging due to the following reasons: 1) the requirement of understanding the video contents and query semantics comprehensively, and 2) building the bridge between the cross-modal semantics. To tackle these problems, early methods first sample video clips and then match them with the sentence to find the most relevant one. To reduce the computational complexity associated with video clip sampling, recent methods directly predict the temporal boundaries of the desired moment on the fused features of the sentence and the video frames. However, all the previous methods often learn the word-level or phrase-level features of the sentence, or directly generates the global sentence representation by attention mechanisms or graph network. However, we argue that applying only word-level or phrase-level semantic information and cross-modal interactions is not enough to fully capture the correspondence between the video and the query. To this end, we proposed a novel Multi-level Query Exploration and Interaction (MQEI) model, which explores the semantics in both the word- and phrase-level and captures the multi-level interactions between the video and the query through an attention module. Extensive experiments on two public benchmark datasets ActivityNet Captions and Charades-STA demonstrate that the proposed model can outperform all the state-of-the-art methods consistently.
Read full abstract