Hdplifter: Hierarchical Dynamics Perception For 2D-to-3D Human Pose Lifting
Recent 2D-to-3D pose lifting networks have achieved remarkable success in monocular 3D human pose estimation through learning joint dependencies. We observed that the extracted 2D pose sequences encountered spatial pose topology ambiguity and temporal movement patterns information loss. Existing methods overlook these intrinsic limitations, resulting in inferior inferences of the corresponding 3D poses. To address these, we introduce the Hierarchical Dynamics Pose Lifter (HDPLifter), which captures spatial human joint connections and subtle temporal movement patterns while maintaining global modeling through hierarchical perception. Specifically, we propose a Structure-aware Spatial Transformer using adaptive topology learning to efficiently integrate spatial joint connections. Moreover, a novel Hierarchical Temporal Transformer is utilized to comprehensively capture subtle joint movement patterns along with global movement patterns with a scaleable receptive field. In both modules, we utilize a 2D depth-wise convolution as a feedforward network to further gather local joint correlations in the spatial and temporal domains simultaneously. Our model, HDPLifter, surpasses the state-of-the-art approach (Motion-BERT) on Human3.6M and MPI-INF-3DHP datasets with P1 errors of 38.0mm and 14.4mm, respectively, while utilizing only 1/5 of the parameters compared to it.