Abstract

Recognition of general actions has achieved great breakthroughs in recent years. However, in real-world applications, finer-grained action classification is often needed. The major challenge is that fine-grained actions usually share high similarities in both appearance and motion pattern, making it difficult to distinguish them with existing general action representation. To solve this problem, we introduce visual attention mechanism into the proposed descriptor, termed as Actionness-pooled Deep-convolutional Descriptor (ADD). Instead of pooling features uniformly from the entire video, we aggregate features in sub-regions that are more likely to contain actions according to actionness maps, which endow ADD with the capability of capturing the subtle differences between fine-grained actions. We conduct experiments on HIT Dances dataset, one of the few existing datasets for fine-grained action analysis. Quantitative results have demonstrated that ADD remarkably outperforms traditional two-stream representation. Extensive experiments on two general action benchmarks, JHMDB and UCF101, have additionally proved that combining ADD with end-to-end ConvNet can further boost the recognition performance.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.