Abstract

Existing online video processing methods such as online action detection focus on a frame-level understanding for high responsiveness. However, it has a fundamental limitation in that it lacks instance-level understanding of videos, making it difficult to be applied to higher-level vision tasks. The instance-level action detection, known as Temporal Action Localization (TAL), have limitations when applying to the online settings. In this work, we introduce a new task that aims to detect action instances of videos in an online setting, named Online Temporal Action Localization (OnTAL). To tackle this problem, we propose a 2-Pass End/Start detection Network (2PESNet) that detects action instances by effectively finding the start and end of an action instance. Additionally, we propose a two-stage action end detection method to further improve the performance. Extensive experiments on THUMOS’14 and ActivityNet v1.3 demonstrate that our model is able to take both accuracy and responsiveness when predicting action instances from streaming videos.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call