Recently, with the growing popularity of Convolutional Neural Networks (CNN), major progress has been made in human pose estimation. However, due to the lack of consideration of human body structure, the current methods still suffer from occlusion, complex background, and large pose variations. Human parsing, as a highly relevant task, can provide useful semantic information of body parts for pose estimation. In this paper, we propose a novel convolutional network, which consists of a pose encoder and a parsing encoder in parallel, to combine parsing information with body structure information for effectively assisting human pose estimation in unconstrained environment. The pose encoder can extract discriminative features that contain rich structure information by incorporating a multi-branch parallel module and a hierarchical connection architecture into hourglass network. On the other hand, the parsing encoder can obtain valuable parsing information that serves as extra constraints to help improve the performance of pose estimation. Then, the generated pose features and parsing features will be integrated together for heatmap prediction. In addition, we apply a joint classification loss for preserving the structural consistency between keypoints and body parts. By correctly localizing the joints in their corresponding body parts, the network can better optimize the spatial relationship between different keypoints, thus reducing the errors of estimation. It is worth mentioning that the proposed modules can be introduced into most of the existing methods. The experimental results on the extended PASCAL-Person-Part and LSP datasets demonstrate the effectiveness of our model, and the performance can be greatly improved by utilizing parsing information.
Read full abstract