Skeleton-based action recognition has attracted great interest in computer vision. For this task, a challenging problem concerns the large intraclass variances of skeleton data, which are mainly caused by diverse viewpoints and subjects, and greatly increase the difficulty of modeling actions through a network. To address the above problem, we propose a variance reduction (VaRe) framework for skeleton-based action recognition, which consists of a view-normalization generative adversarial network (VN-GAN), a subject-independent network (SINet) and a classification network. First, the VN-GAN is responsible for reducing view-induced intraclass variances. Specifically, this network, comprising a generator and a discriminator, is aimed at learning a mapping from a diverse-view skeleton distribution to a unified-view skeleton distribution in an unsupervised manner, thereby generating a view-normalized skeleton. Second, taking the view-normalized skeleton as input, the SINet focuses on reducing the influences of the personal habits of subjects on action recognition. To generate SI skeleton data, the SINet automatically adjusts the human pose according to the human kinematic structure under a classification loss constraint. Finally, without the interference of view- and subject-induced variances, the classification network can concentrate more on learning discriminative action features to predict classes. Furthermore, by combining the joint and bone modalities, the proposed framework achieves competitive performance on three benchmarks: NTU RGB+D, NTU-120 RGB+D and Northwestern-UCLA Multiview Action 3D.
Read full abstract