Frame-level Feature Tokenization Learning for Human Body Pose and Shape Estimation

Hu Cao,Suping Wu,Meining Jia

doi:10.1109/fg52635.2021.9667013

Abstract

In this paper, we propose a frame-level feature tokenization method for human body pose and shape es-timation(FTHE). Despite conventional 3D human pose and shape estimation methods have achieved success based on a single image, recovering accurate and smooth 3D human motion from a video is still challenging. Different from existing methods, our FTHE aims to pay attention to the meaningful detailed temporal feature between different granular tokens of video objects, and reduce the dominance of the current static frame. To this end, we carefully design an accurate and interpretable temporal encoding module for feature extraction and motion reconstruction. More specifically, our model captures temporal features and static features of different granular tokens, and simultaneously enhances their correlation and multi-granularity consistency. Extensive experimental results on large-scale publicly available datasets demonstrate that our FTHE achieves compelling performance compared to the state-of-the-art. Code has been made available at: https://githuh.com/chriful/FG_2020_FTHE.

Full Text