Existing Transformers for 3D human pose and shape estimation models often struggle with computational complexity, particularly when handling high-resolution feature maps. These challenges limit their ability to efficiently utilize fine-grained features, leading to suboptimal performance in accurate body reconstruction. In this work, we propose TransSMPL, a novel Transformer framework built upon the SMPL model, specifically designed to address the challenges of computational complexity and inefficient utilization of high-resolution feature maps in 3D human pose and shape estimation. By replacing HRNet with MobileNetV3 for lightweight feature extraction, applying pruning and quantization techniques, and incorporating an early exit mechanism, TransSMPL significantly reduces both computational cost and memory usage. TransSMPL introduces two key innovations: (1) a multi-scale attention mechanism, reduced from four scales to two, allowing for more efficient global and local feature integration, and (2) a confidence-based early exit strategy, which enables the model to halt further computations when high-confidence predictions are achieved, further enhancing efficiency. Extensive pruning and dynamic quantization are also applied to reduce the model size while maintaining competitive performance. Quantitative and qualitative experiments on the Human3.6M dataset demonstrate the efficacy of TransSMPL. Our model achieves an MPJPE (Mean Per Joint Position Error) of 48.5 mm, reducing the model size by over 16% compared to existing methods while maintaining a similar level of accuracy.
Read full abstract