Abstract

Occlusion handling in crowded scenes is an intractable challenge for human pose estimation. To address this problem, we propose two novel feed-forward network structures named Global Feed-Forward Network (GFFN) and Dynamic Feed-Forward Network (DFFN), which are specifically designed for image-based tasks to capture both local and global contextual information within intermediate features and update feature representations with high adaptability for occlusions. By exploiting the context modeling ability of the proposed GFFN and DFFN, we present a novel backbone network, namely High-Resolution Context Network (HRNeXt), which learns high-resolution representations with abundant contextual information to better estimate poses of occluded human bodies. Compared to state-of-the-art pose estimation networks, our HRNeXt absorbs advantages of convolution operation and attention mechanism, and it is more efficient in terms of training data sizes, network parameters and computational costs. Experimental results show that our HRNeXt significantly outperforms state-of-the-art backbone networks on challenging pose estimation datasets with high occurrence of crowds and occlusions. Code is available at: <uri xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">https://github.com/ZiyiZhang27/HRNeXt</uri> .

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call