Can relearning local representation help small networks for human pose estimation?

Dingning Xu,Lijun Guo,Rong Zhang,Jiangbo Qian,Shangce Gao

doi:10.1016/j.neucom.2022.11.025

Abstract

Human pose estimation is a special detection task for small object localization. It requires considering not only global structure but local and fine detail due to variable body poses and complex scenes. However, with the sliding window learning mechanism, the convolutional neural network (CNN) can only see the spatial information in a specific size of receptive field in a certain layer. As the network deepens and the receptive field becomes larger, the network gradually focuses on the global spatial information and loses the perception of local features. To help the deep convolutional neural network have the ability to relearn local information for structure analysis in deeper layers, we propose a layer-channel mixed attention mechanism named integrated attention that can be flexibly embedded into a CNN. Multiple features from the previous layers are aggregated to build attention with synchronously observing different ranges of spatial structures. Through our integrated attention, the network can observe the interdependence between local structures across different receptive fields and more clues can be learned to enhance the expressive power of the network for feature learning. The results of extensive experiments show that the integrated attention mechanism is beneficial to human pose estimation. In particular, the integrated attention can help small networks achieve more accurate predictions and even outperforms larger ones with less computation and parameters. Compared with other attention and keypoint refinement modules, our improvement effect is more stable and better.

Full Text