Abstract
Due to the great success of Vision Transformer (ViT) in image classification tasks, many pure Transformer architectures for human action recognition have been proposed. However, very few works have attempted to use Transformer to conduct bimodal action recognition, i.e., both skeleton and RGB modalities for action recognition. As proved in many previous works, RGB modality and skeleton modality are complementary to each other in human action recognition tasks. How to use both RGB and skeleton modalities for action recognition in a Transformer-based framework is a challenge. In this paper, we propose RGBSformer, a novel two-stream pure Transformer-based framework for human action recognition using both RGB and skeleton modalities. Using only RGB videos, we can acquire skeleton data and generate corresponding skeleton heatmaps. Then, we input skeleton heatmaps and RGB frames to Transformer at different temporal and spatial resolutions. Because the skeleton heatmaps are primary features compared to the original RGB frames, we use fewer attention layers in the skeleton stream. At the same time, two ways are proposed to fuse the information of two streams. Experiments demonstrate that the proposed framework achieves the state of the art on four benchmarks: three widely used datasets, Kinetics400, NTU RGB+D 60, and NTU RGB+D 120, and the fine-grained dataset FineGym99.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.