Abstract

Facial expression recognition (FER) has a wide variety of applications ranging from human–computer interaction, robotics to health care. Although FER has made significant progress with the success of Convolutional Neural Network (CNN), it is still challenging especially for the video-based FER due to the dynamic changes in facial actions. Since the specific divergences exists among different expressions, we introduce a metric learning framework with a siamese cascaded structure that learns a fine-grained distinction for different expressions in video-based task. We also develop a pairwise sampling strategy for such metric learning framework. Furthermore, we propose a novel action-units attention mechanism tailored to FER task to extract spatial contexts from the emotion regions. This mechanism works as a sparse self-attention fashion to enable a single feature from any position to perceive features of the action-units (AUs) parts (eyebrows, eyes, nose, and mouth). Besides, an attentive pooling module is designed to select informative items over the video sequences by capturing the temporal importance. We conduct the experiments on four widely used datasets (CK+, Oulu-CASIA, MMI, and AffectNet), and also do experiment on the wild dataset AFEW to further investigate the robustness of our proposed method. Results demonstrate that our approach outperforms existing state-of-the-art methods. More in details, we give the ablation study of each component.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call