Abstract
Recent vision Transformer has been applied to human pose estimation and has achieved excellent performance by two-order spatial interaction with self-attention. However, it is still unclear whether higher-order spatial interaction can facilitate pose estimation. In this paper, we propose a novel approach based on multi-order spatial interactions and confirm that the combination of different orders is beneficial for human pose estimation task. We first build a Triple Interaction Module (TIM) by pure convolutions to make spatial information interactions three times. In contrast to Transformer, the TIM is compatible with several pure convolutions and extends two-order interaction in Transformer to triple-order without extensive additional computation, which makes it easier to explore inter-related features between keypoints in the human body. In addition, we combine TIM with traditional CNN and Transformer to form Multi-order Spatial Interaction Network (MSIN). This paper takes advantage of MSIN to extract keypoint heatmaps and certifies that the order-by-order structure can enhance the overall performance of locating human keypoints. Experimental results demonstrate that MSIN performs favorably against the most state-of-the-art CNN-based and Transformer-based counterparts on the COCO and MPII datasets, while being more lightweight.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.