Abstract

Weakly-supervised Temporal Action Localization (WTAL) aims at localizing actions in untrimmed videos with only video-level labels. Most existing methods embrace a "localization by classification" paradigm and adopt a model that pre-trained with recognition task for feature extraction. The gap between recognition and localization tasks leads to inferior performance. Some recent works attempt to utilize feature enhancement to obtain better feature for localization and boost the performance to some extent. However, they are limited to intra-video information exploiting, while ignoring meaningful inter-video information in the dataset. In this paper, we propose a novel Dual-Feature Enhancement (DFE) method for WTAL, which can utilize both intra-and inter-video information. For intra-video, a local feature enhancement module is designed to promote the feature interaction along the temporal dimension within each video. For inter-video information, a global memory module is firstly designed to learn the representations for different categories across different videos. Then, a global feature enhancement module is used to enhance the video features with the help of those global representations in the memory. Besides, to reduce the extra computational cost caused by global enhancement module in the inference stage, a distillation loss is applied to enforce the local branch to learn the information from global branch, so the global enhancement module could be removed during inference. The proposed method achieves state-of-the-art performance on popular benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call