SlowFastFormer for 3D human pose estimation

Lu Zhou,Yingying Chen,Jinqiao Wang

doi:10.1016/j.cviu.2024.103992

Abstract

3D human pose estimation in videos aims at locating the human joints in the 3D space given a temporal sequence. Motion information and skeleton context are two significant elements for pose estimation in videos. In this paper, we propose a SlowFastFormer (slow-fast transformer) network where two branches with different input rates are composed to encode these two different kinds of context. For the slow branch, skeleton context is well learned at a higher frame rate. For the fast branch, motion information is captured at a lower frame rate. Through these two branches, different kinds of context are encoded separately. We fuse these two branches at a later stage to fully utilize the skeleton context and motion information. Afterwards, a blending module is developed to promote the message exchange among multiple branches. In the blending stage, different kinds of context information are exchanged and feature representation is enhanced consequently. Lastly, a hierarchical supervision scheme is tailored where predictions of different levels are inferred in a progressive manner. Our approach achieves competitive performance with lower computation complexity on several benchmarks, i.e., Human3.6M, MPI-INF-3DHP and HumanEva-I.

Full Text