Abstract
The multi-modality based human action recognition is an increasing topic. Multi-modality can provide more abundant and complementary information than single modality. However, it is difficult for multi-modality learning to capture the spatial-temporal information from the entire RGB and depth sequence effectively. In this paper, to obtain better representation of spatial-temporal information, we propose a bidirectional rank pooling method to construct the RGB Visual Dynamic Images (VDIs) and Depth Dynamic Images (DDIs). Furthermore, we design an effective segmentation convolutional networks (ConvNets) architecture based on multi-modality hierarchical fusion strategy for human action recognition. The proposed method has been verified and achieved the state-of-the-art results on the widely used NTU RGB+D, SYSU 3D HOI and UWA3D II datasets.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have