Abstract

Mainstream methods treat head pose estimation as a supervised classification/regression problem, whose performance heavily depends on the accuracy of ground-truth labels of training data. However, it is rather difficult to obtain accurate head pose labels in practice, due to the lack of effective equipment and reasonable approaches for head pose labeling. In this paper, we propose a method which does not need to be trained with head pose labels, but matches the keypoints between a reconstructed 3D face model and the 2D input image, for head pose estimation. The proposed head pose estimation method consists of two components: the 3D face reconstruction and the 3D–2D matching keypoints. At the 3D face reconstruction phase, a personalized 3D face model is reconstructed from the input head image using convolutional neural networks, which are jointly optimized by an asymmetric Euclidean loss and a keypoint loss. At the 3D–2D keypoints matching phase, an iterative optimization algorithm is proposed to match the keypoints between the reconstructed 3D face model and the 2D input image efficiently under the constraint of perspective transformation. The proposed method is extensively evaluated on five widely used head pose estimation datasets, including Pointing’04, BIWI, AFLW2000, Multi-PIE, and Pandora. The experimental results demonstrate that the proposed method achieves excellent cross-dataset performance and surpasses most of the existing state-of-the-art approaches, with average MAEs of on Pointing’04, on BIWI, on AFLW2000, on Multi-PIE, and on Pandora, although the model of the proposed method is not trained on any of these five datasets.

Highlights

  • Head pose plays a significant role in diverse applications such as human–computer interaction [1], driver monitoring [2], and analysis of students’ learning state [3], since it usually indicates the gaze direction and even the attention of a person

  • To avoid suffering from inaccurate labels in training datasets, a head-pose estimation method that employs keypoint-matching between the input image and the corresponding reconstructed 3D face model is proposed in this paper

  • 3D face reconstruction phase, a personalized 3D face model is reconstructed from the input head image using convolutional neural networks which are jointly optimized by an asymmetric Euclidean loss and a keypoint loss

Read more

Summary

Introduction

Head pose plays a significant role in diverse applications such as human–computer interaction [1], driver monitoring [2], and analysis of students’ learning state [3], since it usually indicates the gaze direction and even the attention of a person. From the machine learning perspective, the task of head-pose estimation consists of learning a model that maps an input head image to head-pose angles based on image and ground-truth label pairs It is well-known that the maximum achievable accuracy of a supervised model depends on the quality of the training data [17]. Like the other supervised machine learning tasks, the performance of supervised-learning-based head-pose estimation methods heavily depends on the accuracy of the ground-truth labels in the training dataset [14,15]. The methods trained and tested on the same dataset with inaccurate labels tend to create an illusion that their models can accurately estimate head pose. To avoid suffering from inaccurate labels in training datasets, a head-pose estimation method that employs keypoint-matching between the input image and the corresponding reconstructed 3D face model is proposed in this paper.

Supervised-Learning-Based Methods
Model-Based Methods
Overview
Model Representation
Network Structure
Loss Functions
Weak Perspective Transformation
Implementation Details
Datasets and Performance Metric
Performance Analysis of the Proposed Method
Comparisons with Other Methods
Cross-Dataset Experiments
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call