Abstract

Skeleton-based human action recognition has become an active research area in recent years. The key to this task is to fully explore both spatial and temporal features. Recently, GCN-based methods modeling the human body skeletons as spatial-temporal graphs, have achieved remarkable performances. However, most GCN-based methods use a fixed adjacency matrix defined by the dataset, which can only capture the structural information provided by joints directly connected through bones and ignore the dependencies between distant joints that are not connected. In addition, such a fixed adjacency matrix used in all layers leads to the network failing to extract multi-level semantic features. In this paper we propose a pseudo graph convolutional network with temporal and channel-wise attention (PGCN-TCA) to solve this problem. The fixed normalized adjacent matrix is substituted with a learnable matrix. In this way, the matrix can learn the dependencies between connected joints and joints that are not physically connected. At the same time, learnable matrices in different layers can help the network capture multi-level features in spatial domain. Moreover, Since frames and input channels that contain outstanding characteristics play significant roles in distinguishing the action from others, we propose a mixed temporal and channel-wise attention. Our method achieves comparable performances to state-of-the-art methods on NTU-RGB+D and HDM05 datasets.

Highlights

  • Understanding human action is one of the most important tasks in computer vision, as it facilitates a wide range of applications such as human-computer interaction, robotics and game control

  • Some frames which contain outstanding characteristics play significant roles in distinguishing the action from others. Inspired by such an observation and SENet [30], we propose our temporal and channel-wise attention (TCA) module

  • Our method can be applied to other graph-based methods by replacing the original adjacency matrices with the learnable adjacency matrices and the proposed temporal and channelwise attention module can be integrated into any convolutional neural network (CNN) architectures

Read more

Summary

Introduction

Understanding human action is one of the most important tasks in computer vision, as it facilitates a wide range of applications such as human-computer interaction, robotics and game control. Skeletons consisting of 3D joint positions provide a good representation for describing human actions. With the fast development of low-cost devices to capture 3D data such as Microsoft Kinect [1] in recent years, skeleton data are much easier to fetch. Skeletons themselves are high level features of human bodies and invariant. To appearance or appearances, which eliminate the difficulty in representing and understanding different action categories. Skeletonbased action recognition has recently attracted more and more attention. A. SPATIAL-TEMPORAL GRAPH CONSTRUCTION For the spatial dimension, joints and their natural connections in one frame construct the spatial graph. The corresponding joints between two adjacent frames are connected with temporal edges.

Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.