Abstract

Weakly supervised temporal action localization (WSTAL), which aims to locate the time interval of actions in an untrimmed video with only video-level action labels, has attracted increasing research interest in the past few years. However, a model trained with such labels will tend to focus on segments that contributions most to the video-level classification, leading to inaccurate and incomplete localization results. In this paper, we tackle the problem from a novel perspective of relation modeling and propose a method dubbed Bilateral Relation Distillation (BRD). The core of our method involves learning representations by jointly modeling the relation at the category and sequence levels. Specifically, category-wise latent segment representations are first obtained by different embedding networks, one for each category. We then distill knowledge obtained from a pre-trained language model to capture the category-level relations, which is achieved by performing correlation alignment and category-aware contrast in an intra- and inter-video manner. To model the relations among segments at the sequence-level, we elaborate a gradient-based feature augmentation method and encourage the learned latent representation of the augmented feature to be consistent with that of the original one. Extensive experiments illustrate that our approach achieves state-of-the-art results on THUMOS14 and ActivityNet1.3 datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call