Abstract

The performance of a facial expression recognition network degrades obviously under situations of uneven illumination or partial occluded face as it is quite difficult to pinpoint the attention hotspots on the dynamically changing regions (e.g., eyes, nose, and mouth) as precisely as possible. To address the above issue, by a hybrid of the attention mechanism and pyramid feature, this paper proposes a cascade attention-based facial expression recognition network on the basis of a combination of (i) local spatial feature, (ii) multi-scale-stereoscopic spatial context feature (extracted from the 3-scale pyramid feature), and (iii) temporal feature. Experiments on the CK+, Oulu-CASIA, and RAF-DB datasets obtained recognition accuracy rates of 99.23%, 89.29%, and 86.80%, respectively. It demonstrates that the proposed method outperforms the state-of-the-art methods in both the experimental and natural environment.

Highlights

  • Network by Fusing Multi-ScaleHuman facial expression is one of the most natural and universal physiological signals by which humans can convey their feelings and behavioral trends

  • The mainstream methods of static facial expression recognition include traditional manual feature methods such as LBP [3] and SIFT [4]; the aforementioned traditional methods have difficulty extracting powerful temporal features hidden in facial images by manual descriptors

  • We proposed a novel attention aggregation method for the feature-weighted aggregation of local and multi-scale-stereoscopic spatial context features to focus on regions that contribute more to facial expression recognition, and we investigated the efficiency of single attention and cascading attention blocks for feature aggregation

Read more

Summary

Introduction

Network by Fusing Multi-ScaleHuman facial expression is one of the most natural and universal physiological signals by which humans can convey their feelings and behavioral trends. The mainstream methods of static facial expression recognition include traditional manual feature methods such as LBP [3] and SIFT [4]; the aforementioned traditional methods have difficulty extracting powerful temporal features hidden in facial images by manual descriptors. Because facial expression reflected in video sequences is a dynamic process, many studies employ dynamic methods to learn face image features while incorporating face networks to extract temporal and spatial features of facial expression images [5]. The accuracy of facial expression recognition in video sequences is still influenced by lighting, deflection, occlusion, and other objective factors affecting image quality [8]. A variety of facial expression recognition methods [9,10,11] learn facial expression features by eliminating the interference caused by various interference factors

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call