Abstract

The emergence of low-cost depth sensors opens up new potentials for skeleton-based human action recognition. The recent methods for this task have made significant progress by incorporating graph convolution. However, they (1) have limitations in modeling the complex and variable temporal dynamics, and (2) cannot maximize the complementarity of the spatial and temporal features. Besides, (3) the loss function of these methods has an inherent weakness in optimizing the intraclass compactness. To this end, we propose a pyramidal graph convolutional network (PY-GCN) in this paper. Specifically, (1) an effective yet efficient single-oriented pyramidal convolution is proposed. It involves multiple kernels with varying sizes and depths that are capable of capturing different levels of the temporal dynamics at multiple scales. (2) A pseudo-two-stream structure for the basic block of the network is proposed to comprehensively aggregate discriminative cross-spatiotemporal features. Moreover, (3) a pairwise Gaussian loss together with the cross-entropy loss is introduced to the model, which can focus on both intraclass compactness and interclass separability. Our PY-GCN achieves state-of-the-art performance on three challenging large-scale datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call