Abstract

Skeleton-based human action recognition is becoming popular due to its computational efficiency and robustness. Since not all skeleton joints are informative for action recognition, attention mechanisms are adopted to extract informative joints and suppress the influence of irrelevant ones. However, existing attention frameworks usually ignore helpful scenario context information. In this paper, we propose a cross-attention module that consists of a self-attention branch and a cross-attention branch for skeleton-based action recognition. It helps to extract joints that are not only more informative but also highly correlated to the corresponding scenario context information. Moreover, the cross-attention module maintains input variables' size and can be flexibly incorporated into many existing frameworks without breaking their behaviors. To facilitate end-to-end training, we further develop a scenario context information extraction branch to extract context information from raw RGB video directly. We conduct comprehensive experiments on the NTU RGB+D and the Kinetics databases, and experimental results demonstrate the correctness and effectiveness of the proposed model.

Highlights

  • Human action recognition is a fundamental and challenging research problem in computer vision [1]–[8]

  • To utilize the complementary scenario context information, we propose a cross-attention module for skeleton-based action recognition

  • We incorporate the RGB video into the learning of cross-attention module, our implementation is significantly different from existing RGB video-based action recognition methods: (1) we aim to extract context information with a lightweight and relatively shallow network from RGB video to promote the cross-attention module

Read more

Summary

Introduction

Human action recognition is a fundamental and challenging research problem in computer vision [1]–[8]. The performance of human action recognition has an important influence on many other tasks like video understanding and video surveillance. With the development of depth sensors like Kinetic [14] and pose estimation technique [15], [16], skeleton-based human action recognition receives more and more attention recently [6], [17]–[19]. Human actions can be represented by a sequence of skeleton joints. It is well studied that for a certain action, different joints may contain different information and

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call