Abstract

As a sub-challenge of EmotiW (the Emotion Recognition in the Wild challenge), how to improve performance on the AFEW (Acted Facial Expressions in the wild) dataset is a popular benchmark for emotion recognition tasks with various constraints, including uneven illumination, head deflection, and facial posture. In this paper, we propose a convenient facial expression recognition cascade network comprising spatial feature extraction, hybrid attention, and temporal feature extraction. First, in a video sequence, faces in each frame are detected, and the corresponding face ROI (range of interest) is extracted to obtain the face images. Then, the face images in each frame are aligned based on the position information of the facial feature points in the images. Second, the aligned face images are input to the residual neural network to extract the spatial features of facial expressions corresponding to the face images. The spatial features are input to the hybrid attention module to obtain the fusion features of facial expressions. Finally, the fusion features are input in the gate control loop unit to extract the temporal features of facial expressions. The temporal features are input to the fully connected layer to classify and recognize facial expressions. Experiments using the CK+ (the extended Cohn Kanade), Oulu-CASIA (Institute of Automation, Chinese Academy of Sciences) and AFEW datasets obtained recognition accuracy rates of 98.46%, 87.31%, and 53.44%, respectively. This demonstrated that the proposed method achieves not only competitive performance comparable to state-of-the-art methods but also greater than 2% performance improvement on the AFEW dataset, proving the significant outperformance of facial expression recognition in the natural environment.

Highlights

  • The average classification accuracy obtained in the experiments on the CK+, AFEW, and Oulu-CASIA datasets is shown in Tables 2–4, respectively

  • On the AFEW dataset, the data come from the natural environment, which is restricted by head deflection, illumination, and blur

  • Among other state-of-the-art methods, DenseNet-161 [24] has an accuracy rate of 51.40%, which is 2.04% lower than our method. This shows that our method is better than other methods for facial expression recognition in natural environments

Read more

Summary

Introduction

Automatic facial expression recognition (FER) has significant application potential to improve human–computer interaction. Traditional expression recognition methods, such as principal component analysis (PCA) [4,5], Gabor wavelet [6], and local binary pattern [7], use static images for recognition. These methods only consider the expression at the peak and ignore the influence of dynamic changes [8]; researchers have gradually shifted their focus from static image recognition to dynamic video sequence recognition [9].

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call