Multi-scale spatial-temporal transformer for 3D human pose estimation

Yongpeng Wu,Junna Gao

doi:10.1109/icvisp54630.2021.00051

Abstract

Most existing video-based 3D human pose estimation methods focus on single-scale spatial and temporal feature extraction. However, many human motions are only related to local joints, which suggests that we need to pay attention to the local pose of the human body for 3D pose estimation. In this paper, we propose a novel multi-scale spatial-temporal transformer framework to tackle the problem of 3D human pose estimation. Our framework mainly consists of two separate modules: a multi-scale spatial transformer module and a multiscale temporal transformer module. The first module is designed to enhance the spatial dependencies by the joint-level and part-level spatial transformers. The goal for the second module is to capture the temporal correlation of human pose by the local part-level and global whole-level temporal transformer. Then we apply a weight fusion module to predict accurate 3D human pose of the center frame. Experimental results show that our method achieves excellent performance.

Full Text