Abstract

Temporal action localization from untrimmed videos is a fundamental task for real-world computer vision applications such as video surveillance systems. Even though a great deal of research attention has been paid to the problem, precise localization of human activities at a frame level still remains as a challenge. In this paper, we propose CoarseFine networks that learn highly discriminative features without loss of time granularity with two streams: the coarse and fine networks. The coarse network aims to classify the action category based on the global context of a video by taking advantage of the description power of successful action recognition models. On the other hand, the fine network does not deploy temporal pooling constrained with a low channel capacity. The fine network is specialized to identify the per-frame location of actions based on local semantics. This approach enables CoarseFine networks to learn find-grained representations without any temporal information loss. Our extensive experiments on two challenging benchmarks, THUMOS14 and ActivityNet-v1.3, validate that our proposed method provides a higher accuracy compared to the state-of-the-art by a remarkable margin in per-frame labeling and temporal action localization tasks while the computational cost is significantly reduced.

Highlights

  • The number of videos in our lives has been exponentially increasing with the advancement of digital imaging technology

  • We propose CoarseFine networks to learn fine-grained temporal representations by taking the benefits of both temporal preservation and downsampling with two streams: the coarse network focusing on the action classes and the fine network concentrating on the temporal locations

  • Extensive experiments demonstrate that our CoarseFine networks outperform the state-of-the-art methods in terms of the per-frame labeling task

Read more

Summary

Introduction

The number of videos in our lives has been exponentially increasing with the advancement of digital imaging technology. Video recording devices, ranging from non-stop equipment such as CCTVs and dashcams to hand-held portable devices like mobile phones, are constantly producing numerous visual data. The automatic video analysis, especially for human activity understanding, becomes an indispensable and core technology for various real-world applications. The performance of human activity recognition from a temporally trimmed video containing a single action instance has reached to a significant level through the research efforts in the past decades. Videos in the wild are untrimmed and can have multiple action instances with variety of background contents. Research attention of the action recognition study is recently shifting to

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call