Abstract

Recent vision Transformer has been applied to human pose estimation and has achieved excellent performance by two-order spatial interaction with self-attention. However, it is still unclear whether higher-order spatial interaction can facilitate pose estimation. In this paper, we propose a novel approach based on multi-order spatial interactions and confirm that the combination of different orders is beneficial for human pose estimation task. We first build a Triple Interaction Module (TIM) by pure convolutions to make spatial information interactions three times. In contrast to Transformer, the TIM is compatible with several pure convolutions and extends two-order interaction in Transformer to triple-order without extensive additional computation, which makes it easier to explore inter-related features between keypoints in the human body. In addition, we combine TIM with traditional CNN and Transformer to form Multi-order Spatial Interaction Network (MSIN). This paper takes advantage of MSIN to extract keypoint heatmaps and certifies that the order-by-order structure can enhance the overall performance of locating human keypoints. Experimental results demonstrate that MSIN performs favorably against the most state-of-the-art CNN-based and Transformer-based counterparts on the COCO and MPII datasets, while being more lightweight.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call