Abstract

Temporal action localization, which aims to locate temporal regions where actions take place and recognize their corresponding classes in untrimmed real-world videos, is a challenging task. As a critical cue to video understanding, exploiting the video context has become an important strategy to boost the localization performance. However, previous methods mainly focus on exploring semantic context which captures the feature similarity among frames or proposals. The temporal position context which is also vital for temporal action localization is less explored. In this paper, we propose a position-sensitive context modeling approach to fuse both semantic and position context for more precise action localization. Specifically, we first propose a position encoding method tailored for temporal action localization on both frame-level and proposal-level, which ensures that the generated position representations can model the distance and chronological relationships among frames or proposals. Then we conduct attention-based context aggregation to produce discriminative features and help with precise boundary detection and proposal evaluation. Our method achieves state-of-the-art performance on two widely used datasets, THUMOS-14 and ActivityNet-1.3, demonstrating the effectiveness and generalizability of our method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call