Segment spatial-temporal representation and cooperative learning of convolution neural networks for multimodal-based action recognition

Ziliang Ren,Qieshi Zhang,Jun Cheng,Fusheng Hao,Xiangyang Gao

doi:10.1016/j.neucom.2020.12.020

Abstract

The multimodal based human action recognition is an attracting increasing topic since the different modalities can provide complementary information. However, it is difficult to improve the recognition performance due to the limitation of the ability to learn spatial-temporal features. In this paper, we propose a novel approach for multimodal human action recognition by learning complementary features from RGB-D sequence. Firstly, a segmented rank pooling method is proposed to compress the entire RGB-D sequence into dynamic images as inputs to the Convolutional Networks (ConvNets) for capturing spatial-temporal information. Secondly, a Segment Cooperative ConvNets (SC-ConvNets) is designed to learn the complementary features from RGB-D modalities. Different from the ConvNets-based approaches that learn multimodal features with multiple separate networks, the proposed SC-ConvNets enhance the recognition performance through joint optimization learning of single ConvNets. Then a single loss function is optimized to narrow the variance between RGB and depth modalities. In order to verify the effectiveness of the proposed method, we evaluate the SC-ConvNets on four public benchmark multimodal datasets, including NTU RGB+D 60, NTU RGB+D 120, SYSU 3D HOI, and PKU-MMD datasets. The proposed method achieves an accuracy of 89.4% and 91.2% for cross-subject and cross-view on NTU RGB+D 60, 86.9% and 87.7% for cross-subject and cross-setup on NTU RGB+D 120, 92.1% and 93.2% for cross-subject and cross-view on PKU-MMD, which are the state-of-the-art, and the accuracy of 84.2% and 82.9% for setting-1 and setting-2 on SYSU 3D HOI, which are comparable to the state-of-the-art. The experimental results demonstrate that the proposed segmented rank pooling can represent discriminative spatial-temporal information from the entire RGB and depth sequence, and the proposed SC-ConvNets can enhance recognition performance by learning complementary features from different modalities.

Full Text