The effectiveness of Token Mixer in visual tasks is well-established; however, its high computational complexity and a relatively singular spatial relationship modeling perspective present challenges. In this study, we propose LMFormer, a hybrid model based on CNN and Transformer architectures for human pose estimation. To achieve this, we first design a lightweight multi-feature perspective Token Mixer, using a lightweight feature reconstruction strategy to simultaneously aggregate the spatial and channel feature information, thereby enhancing the performance and generalization capabilities of the model. Subsequently, we explore multi-scale information interaction by developing an iterative multi-feature weighting module, coupled with the design of a multi-scale information propagation mechanism incorporated into the skip connections. Finally, we validate the effectiveness of the network on benchmark datasets, including COCO, MPII, and CrowdPose, utilizing a multi-scale deep supervision strategy. Extensive experiments demonstrate that LMFormer, with reduced computational complexity, comprehensively captures multi-scale features, resulting in significant performance improvements. Specifically, LMFormer-B achieves an AP score of 65.8 on the COCO val dataset, surpassing MobileNetV2 and ShuffleNetV2 by 1.0 and 5.6 points, respectively. Moreover, its parameters are merely 19.8% and 25% of MobileNetV2 and ShuffleNetV2, with corresponding GFLOPs at 43.8% and 50%. We aim to provide new insights into lightweight and efficient feature extraction strategies, as well as efficient Token Mixer designs.
Read full abstract