Abstract

To improve the accuracy of human pose estimation, a novel method based on the deep high-resolution network (HRNet) and equipped with double attention residual blocks is proposed. Firstly, the channel attention and spatial attention modules are added to the residual block of feature extraction, resulting in the network paying more attention to the target area which needs to be extracted important information and suppressed unimportant information. Moreover, this paper proposes a novel module, Parallel Residual Attention Block (PRAB), which parallels the $3\times 3$ group convolution of ResNeXt to the $3\times 3$ convolution layer in the Bottleneck of ResNet, and then adds channel attention and spatial attention modules to these two branches respectively. In this way, the network can further improve the accuracy of human keypoint detection without significantly increasing the computation overhead. To demonstrate the effectiveness of our method, a series of comparative experiments are conducted on the MPII Human Pose dataset and the COCO2017 keypoint detection dataset. Experimental results illustrate that the attention mechanism is effective to improve the accuracy of human pose estimation and the proposed PRAB obtained the best results 90.5% on MPII which outperforms the existing methods.

Highlights

  • Human pose estimation is a basic research topic in computer vision, which has a broad application in behavior recognition, human-computer interaction, automatic driving, etc

  • Human pose estimation in the static images is the basis of video pose estimation and tracking [1]–[4], which means it is very useful for higher-level tasks such as action recognition [5]

  • Adding channel attention and spatial attention into the human pose estimation network can help the network model to give different weights to the features extracted from different parts of the image, and pay more attention to the information which is useful to the task

Read more

Summary

INTRODUCTION

Human pose estimation is a basic research topic in computer vision, which has a broad application in behavior recognition, human-computer interaction, automatic driving, etc. Most of the above methods obtain high-resolution output by performing down-sampling and up-sampling, while the network structure proposed by HRNet [12] maintains high-resolution representation by connecting multiple subnetworks in parallel It enhances high-resolution representation by multi-scale fusion, which provides higher accuracy for human pose estimation. Adding channel attention and spatial attention to the human pose estimation algorithm can help the model give different weights to input features, extract more critical information, and improve the overall performance of the network. Adding channel attention and spatial attention into the human pose estimation network can help the network model to give different weights to the features extracted from different parts of the image, and pay more attention to the information which is useful to the task. We multiply the weight coefficient Mc(F) by the original feature F to get the channel attention map Fc

SPATIAL ATTENTION MODULE
EXPERIMENTS
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call