Abstract

Human action recognition (HAR) under different camera viewpoints is the most critical requirement for practical deployment. In this paper, we propose a novel method that leverages successful deep learning-based features for action representation and multi-view analysis to accomplish robust HAR under viewpoint changes. Specifically, we investigate various deep learning techniques, from 2D CNNs to 3D CNNs to capture spatial and temporal characteristics of actions at each separated camera view. A common feature space is then constructed to keep view-invariant features among extracted streams. This is carried out by learning a set of linear transformations that project private features into the common space in which the classes are well distinguished from each other. To this end, we first adopt Multi-view Discriminant Analysis (MvDA). The original MvDA suffers from odd situations in which the most class-discrepant common space could not be found because its objective is overly concentrated on pushing classes from the global mean but unaware of the distance between specific pairs of adjoining classes. We then introduce a pairwise-covariance maximizing extension that takes pairwise distances between classes into account, namely pc-MvDA. The novel method also differs in the way that could be more favorably applied for large high-dimensional multi-view datasets. Extensive experimental results on four datasets (IXMAS, MuHAVi, MICAGes, NTU RGB+D) show that pc-MvDA achieves consistent performance gain, especially for harder classes. The code is publicly available for research purpose at https://github.com/inspiros/pcmvda .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call