Multi-Level Query Interaction for Temporal Language Grounding

Haoyu Tang,Jihua Zhu,Qinghai Zheng,Tianwei Zhang,Lin Wang

doi:10.1109/tits.2021.3110713

Abstract

Understanding what is happening in the surveillance video is important for human-machine interface in transportation systems, where temporal language grounding is one of the key tasks, targeting at localizing the desired moment in an untrimmed video with a given sentence query that is relevant to the moment. This task is challenging due to the following reasons: 1) the requirement of understanding the video contents and query semantics comprehensively, and 2) building the bridge between the cross-modal semantics. To tackle these problems, early methods first sample video clips and then match them with the sentence to find the most relevant one. To reduce the computational complexity associated with video clip sampling, recent methods directly predict the temporal boundaries of the desired moment on the fused features of the sentence and the video frames. However, all the previous methods often learn the word-level or phrase-level features of the sentence, or directly generates the global sentence representation by attention mechanisms or graph network. However, we argue that applying only word-level or phrase-level semantic information and cross-modal interactions is not enough to fully capture the correspondence between the video and the query. To this end, we proposed a novel Multi-level Query Exploration and Interaction (MQEI) model, which explores the semantics in both the word- and phrase-level and captures the multi-level interactions between the video and the query through an attention module. Extensive experiments on two public benchmark datasets ActivityNet Captions and Charades-STA demonstrate that the proposed model can outperform all the state-of-the-art methods consistently.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Level Query Interaction for Temporal Language Grounding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Intelligent Transportation Systems

Lead the way for us

Journal: IEEE Transactions on Intelligent Transportation Systems	Publication Date: Dec 1, 2022
Citations: 8

Similar Papers

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.
Zongmeng Zhang ... Yan Yan
IEEE Transactions on Image Processing | VOL. 30
Zongmeng Zhang, et. al.Zongmeng Zhang ... Yan Yan
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 30

Cross-Modal Interaction Network for Video Moment Retrieval
Shen Ping ... Ronghui Cao
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 37
Shen Ping, et. al.Shen Ping ... Ronghui Cao
30 Jun 2023
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 37

Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization
Daizong Liu ... Xiaoye Qu
-
Daizong Liu, et. al.Daizong Liu ... Xiaoye Qu
12 Oct 2020
12 Oct 2020

Frame-Wise Cross-Modal Matching for Video Moment Retrieval
Haoyu Tang ... Meng Liu
IEEE Transactions on Multimedia | VOL. 24
Haoyu Tang, et. al.Haoyu Tang ... Meng Liu
29 Oct 2020
IEEE Transactions on Multimedia | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Level Query Interaction for Temporal Language Grounding

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Intelligent Transportation Systems