Zoom Transformer for Skeleton-Based Group Activity Recognition

Jiaxu Zhang,Wei Xie,Yifan Jia,Zhigang Tu

doi:10.1109/tcsvt.2022.3193574

Abstract

Skeleton-based human action recognition has attracted increasing attention and many methods have been proposed to boost the performance. However, these methods still confront three main limitations: 1) Focusing on single-person action recognition while neglecting the group activity of multiple people (more than 5 people). In practice, multi-person group activity recognition via skeleton data is also a meaningful problem. 2) Unable to mine high-level semantic information from the skeleton data, such as interactions among multiple people and their positional relationships. 3) Existing datasets used for multi-person group activity recognition are all RGB videos involved, which cannot be directly applied to skeleton-based group activity analysis. To address these issues, we propose a novel Zoom Transformer to exploit both the low-level single-person motion information and the high-level multi-person interaction information in a uniform model structure with carefully designed Relation-aware Maps. Besides, we estimate the multi-person skeletons from the existing real-world video datasets i.e. Kinetics and Volleyball-Activity, and release two new benchmarks to verify the effectiveness of our Zoom Transfromer. Extensive experiments demonstrate that our model can effectively cope with the skeleton-based multi-person group activity. Additionally, experiments on the large-scale NTU-RGB+D dataset validate that our model also achieves remarkable performance for single-person action recognition. The code and the skeleton data are publicly available at <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/Kebii/Zoom-Transformer</uri>

Full Text