Abstract

Most existing video-based 3D human pose estimation methods focus on single-scale spatial and temporal feature extraction. However, many human motions are only related to local joints, which suggests that we need to pay attention to the local pose of the human body for 3D pose estimation. In this paper, we propose a novel multi-scale spatial-temporal transformer framework to tackle the problem of 3D human pose estimation. Our framework mainly consists of two separate modules: a multi-scale spatial transformer module and a multiscale temporal transformer module. The first module is designed to enhance the spatial dependencies by the joint-level and part-level spatial transformers. The goal for the second module is to capture the temporal correlation of human pose by the local part-level and global whole-level temporal transformer. Then we apply a weight fusion module to predict accurate 3D human pose of the center frame. Experimental results show that our method achieves excellent performance.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.