Abstract

Given a natural language description, temporal textual localization aims to localize the most relevant segment in an untrimmed video, which is a natural and imperative extension of temporal action localization. Most existing temporal textual localization works neglect the long-range semantic modeling in video contents and lack accurate textual understanding. Moreover, they remain in single-task learning and fail to exploit multi-view supervised information. Based on these observations, we introduce a novel adversarial bi-directional interaction network, which is a global framework to retrieve the target segment directly. Specifically, we propose a bi-directional attention mechanism to build bi-directional information interaction, which captures long-range semantic dependencies from video context and enhances textual representation learning. After localization, we further advise an auxiliary discriminator network to verify the localization result and boost the performance by adversarial training process. We adopt multi-task learning approach to train our model, including: (1) predicting coordinate probability distribution task, which selects start and end frame to localize target segment; (2) predicting frame-level correlation distribution task, which calculates the correlation between frame and description; (3) auxiliary adversarial learning task, which calculates matched score between localization and description to boost the performance. The extensive experiments on ActivityNet Captions and TACoS show the significant effectiveness and efficiency of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call