Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Xiaoye Qu,Pan Zhou,Yu Cheng,Zichuan Xu,Zhikang Zou,Jianfeng Dong,Pengwei Tang

doi:10.1145/3394171.3414053

Abstract

Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: ActivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

STCM-Net: A symmetrical one-stage network for temporal language localization in videos
Zixi Jia ... Chunbo Li
Neurocomputing | VOL. 471
Zixi Jia, et. al.Zixi Jia ... Chunbo Li
16 Nov 2021
Neurocomputing | VOL. 471

Text-Based Localization of Moments in a Video Corpus.
Sudipta Paul ... Niluthpol Chowdhury Mithun
IEEE Transactions on Image Processing | VOL. 30
Sudipta Paul, et. al.Sudipta Paul ... Niluthpol Chowdhury Mithun
01 Jan 2020
IEEE Transactions on Image Processing | VOL. 30

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
Songyang Zhang ... Jiebo Luo
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34
Songyang Zhang, et. al.Songyang Zhang ... Jiebo Luo
03 Apr 2020
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 34

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
Shaoxiang Chen ... Yu-Gang Jiang
-
Shaoxiang Chen, et. al.Shaoxiang Chen ... Yu-Gang Jiang
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fine-grained Iterative Attention Network for Temporal Language Localization in Videos

Abstract

Talk to us

Similar Papers