Abstract

We present an innovative approach for 2D person pose estimation by developing a convolutional neural network for human 2-channel mask prediction and human 2D pose estimation. Conceptually our idea is simple, inspired by prior image segmentation research in general. We establish a perception that explicitly encoded mask data can be served as a critical feature for person pose estimation. We propose a convolution neural network model by combining the image segmentation technique with the bottom-up approach for human pose estimation. We observe that the construction of a two stage-network for training in an end-to-end manner is beneficial to one another: for person mask prediction and 2D person pose estimation. At the pose estimation stage, we detect heat-maps against the person keypoints location from the mask information and their mutual connection relations. They are then used to estimate an ultimate pose in a way to remove the unwanted or occluded keypoints, as those keypoints may propagate across the network and lead to redundant pose estimation. We train and test our system on the MS-COCO dataset, and the experimental results validate the superior efficiency of the proposed methodology.

Highlights

  • Multi -person pose estimation is used broadly in different applications of computer vision

  • MS-COCO provides annotations for mask segmentation as well as for person keypoint annotations, which are the basic requirements of our method

  • The main idea of this work is that, in human pose estimation, the explicitly encoded mask data serves as a critical feature in the structure of generative methods

Read more

Summary

Introduction

Multi -person pose estimation is used broadly in different applications of computer vision. The purpose of this research topic is to estimate different parts of a human body in an image or a video and (after detecting all human poses automatically) form a skeleton structure of a human body. Pose estimation is a challenging problem as several key factors are needed to be taken into account, such as the background, a variety of clothing, different lighting conditions, etc. Used techniques are based on hand crafted features, e.g., Edgelet [1], and HOG (histogram of Oriented Gradients) [2], [3], but they are inadequate to locate the exact location of human body parts. Deep-learning-based techniques offer pixel-to-pixel correspondence by convolutions in pose estimation, and there are many improvements needed for the development of a real-time pose estimation network

Objectives
Methods
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call