Abstract
Human Pose Estimation (HPE) is defined as the problem of human joints’ localization (also known as keypoints: elbows, wrists, etc.) in images or videos. It is also defined as the search for a specific pose in space of all articulated joints. HPE has recently received significant attention from the scientific community. The main reason behind this trend is that pose estimation is considered as a key step for many computer vision tasks. Although many approaches have reported promising results, this domain remains largely unsolved due to several challenges such as occlusions, small and barely visible joints, and variations in clothing and lighting. In the last few years, the power of deep neural networks has been demonstrated in a wide variety of computer vision problems and especially the HPE task. In this context, we present in this paper a Deep Full-Body-HPE (DFB-HPE) approach from RGB images only. Based on ConvNets, fifteen human joint positions are predicted and can be further exploited for a large range of applications such as gesture recognition, sports performance analysis, or human-robot interaction. To evaluate the proposed deep pose estimation model, we apply it to recognize the daily activities of a person in an unconstrained environment. Therefore, the extracted features, represented by deep estimated poses, are fed to an SVM classifier. To validate the proposed architecture, our approach is tested on two publicly available benchmarks for pose estimation and activity recognition, namely the J-HMDBand CAD-60datasets. The obtained results demonstrate the efficiency of the proposed method based on ConvNets and SVM and prove how deep pose estimation can improve the recognition accuracy. By means of comparison with state-of-the-art methods, we achieve the best HPE performance, as well as the best activity recognition precision on the CAD-60 dataset.
Highlights
The amount of available video data is explosively expanding due to the pervasiveness of digital recording devices
Based on the work of Charles et al [48], a joint is considered to be correctly located if it is within a set distance of d pixels from a marked joint center in the Ground Truth (GT)
For the CAD-60 dataset, different pose estimation results are presented in Figure 6 as accuracy graphs according to the allowed distance from the GT after applying the four-fold cross-validation process
Summary
The amount of available video data is explosively expanding due to the pervasiveness of digital recording devices. Previous works on HPE have commonly used graphical models for estimating human poses. Those models are composed of joints and rigid parts. In [7], the authors presented a graphical model for HPE with image-dependent pairwise relations They used the local image measurements, to detect joints, and to predict the spatial relationships between them. This aims to learn conditional probabilities for the presence of parts and their spatial relationships. After that, another approach was proposed using puppets [8]. It estimates the body poses at one frame, checks its performance in neighboring ones using the optical flow
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.