Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

Zongmeng Zhang,Yan Yan,Xuemeng Song,Xianjing Han,Liqiang Nie

doi:10.1109/tip.2021.3113791

Abstract

This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing

Lead the way for us

Journal: IEEE Transactions on Image Processing	Publication Date: Jan 1, 2021
Citations: 24

Similar Papers

VMLH: Efficient Video Moment Location via Hashing
Zhifang Tan ... Chenglong Li
Electronics | VOL. 12
Zhifang Tan, et. al.Zhifang Tan ... Chenglong Li
13 Jan 2023
Electronics | VOL. 12

Proposal-free Temporal Moment Localization of a Natural-Language Query in Video using Guided Attention
Cristian Rodriguez-Opazo ... Edison Marrese-Taylor
-
Cristian Rodriguez-Opazo, et. al.Cristian Rodriguez-Opazo ... Edison Marrese-Taylor
01 Mar 2020
01 Mar 2020

Hierarchical Matching and Reasoning for Action Localization via Language Query
Tianyu Li ... Xinxiao Wu
-
Tianyu Li, et. al.Tianyu Li ... Xinxiao Wu
01 Jan 2020
01 Jan 2020

Exploiting Auxiliary Caption for Video Grounding
Hongxiang Li ... Meng Cao
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Hongxiang Li, et. al.Hongxiang Li ... Meng Cao
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Multi-Modal Interaction Graph Convolutional Network for Temporal Language Localization in Videos.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Image Processing