Abstract

Existing online video processing methods such as online action detection focus on a frame-level understanding for high responsiveness. However, it has a fundamental limitation in that it lacks instance-level understanding of videos, making it difficult to be applied to higher-level vision tasks. The instance-level action detection, known as Temporal Action Localization (TAL), have limitations when applying to the online settings. In this work, we introduce a new task that aims to detect action instances of videos in an online setting, named Online Temporal Action Localization (OnTAL). To tackle this problem, we propose a 2-Pass End/Start detection Network (2PESNet) that detects action instances by effectively finding the start and end of an action instance. Additionally, we propose a two-stage action end detection method to further improve the performance. Extensive experiments on THUMOS’14 and ActivityNet v1.3 demonstrate that our model is able to take both accuracy and responsiveness when predicting action instances from streaming videos.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.