Abstract

In this paper, we propose a frame-level feature tokenization method for human body pose and shape es-timation(FTHE). Despite conventional 3D human pose and shape estimation methods have achieved success based on a single image, recovering accurate and smooth 3D human motion from a video is still challenging. Different from existing methods, our FTHE aims to pay attention to the meaningful detailed temporal feature between different granular tokens of video objects, and reduce the dominance of the current static frame. To this end, we carefully design an accurate and interpretable temporal encoding module for feature extraction and motion reconstruction. More specifically, our model captures temporal features and static features of different granular tokens, and simultaneously enhances their correlation and multi-granularity consistency. Extensive experimental results on large-scale publicly available datasets demonstrate that our FTHE achieves compelling performance compared to the state-of-the-art. Code has been made available at: https://githuh.com/chriful/FG_2020_FTHE.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call