Abstract

Multimodal human action recognition with depth sensors has drawn wide attention, due to its potential applications such as health-care monitoring, smart buildings/home, intelligent transportation, and security surveillance. As one of the obstacles of robust action recognition, sub-actions sharing, especially among similar action categories, makes human action recognition more challenging. This paper proposes a segmental architecture to exploit the relations of sub-actions, jointly with heterogeneous information fusion and Class-privacy Preserved Collaborative Representation (CPPCR) for multi-modal human action recognition. Specifically, a segmental architecture is proposed based on the normalized action motion energy. It models long-range temporal structure over video sequences to better distinguish the similar actions bearing sub-action sharing phenomenon. The sub-action based depth motion and skeleton features are then extracted and fused. Moreover, by introducing within-class local consistency into Collaborative Representation (CR) coding, CPPCR is proposed to address the challenging sub-action sharing phenomenon, learning the high-level discriminative representation. Experiments on four datasets demonstrate the effectiveness of the proposed method.

Highlights

  • RELATED WORKAccording to the feature extraction for action recognition, existing methods can be categorized into hand-crafted feature based and deep learning based

  • The results demonstrate that the proposed CPCCR achieves 91.5% which is higher than Multimodal Hybrid Centroid Canonical Correlation Analysis (CCA), Multimodal Centroid CCA and MCCA by 1.5%, 3.5% and 9%, respectively

  • There are many challenges in human action recognition based on RGB-Depth sensors, among which ubiquitous sub-action sharing phenomenon is a critical one

Read more

Summary

RELATED WORK

According to the feature extraction for action recognition, existing methods can be categorized into hand-crafted feature based and deep learning based. PROPOSED HETEROGENEOUS FEATURES AND FUSION For depth sensor based human action datasets, skeleton data are not always accurate, as shown in the third row of Figure 1. In order to express the dynamic information such as the speed variations of human motion over time, we propose to segment each video into temporal sub-actions according to the motion energy function, so that sub-actions with different lengths can be obtained, which is called as ‘‘energy-oriented’’ segmentation. The motion energy of a frame reflects current frame’s relative motion status and location with respect to the entire activity Based on this method, a video is divided adaptively into sub-segments of unequal length globally, effectively capturing the motion’s temporal orders.

ENERGY-ORIENTED DEPTH FEATURE EXTRACTION
PRELIMINARY
EXPERIMENTS AND PERFORMANCE EVALUATION
ABLATION EVALUATION
1: Training phase and inputs:
3: Testing phase
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call