MCF-Net: Fusion Network of Facial and Scene Features for Expression Recognition in the Wild

Hui Xu,Xiangqin Kong,Jianzhong Wang,Jun Kong,Juan Li

doi:10.3390/app122010251

Abstract

Nowadays, the facial expression recognition (FER) task has transitioned from a laboratory-controlled scenario to in-the-wild conditions. However, recognizing facial expressions in the wild is challenging due to factors such as variant backgrounds, low-quality facial images, and the subjectiveness of annotators. Therefore, deep neural networks have increasingly been leveraged to learn discriminative representations for FER. In this work, we propose the Multi-cues Fusion Net (MCF-Net), a novel deep learning model with a two-stream structure for FER. Our model first proposes a two-stream coding network to extract face and scene representations. Then, an adaptive fusion module is employed to fuse the two different representations for final recognition. In the face coding stream, a Sparse Mask Attention Learning (SMAL) module is utilized to adaptively generate the corresponding sparse face mask according to the input image. Meanwhile, we employ a Multi-scale Attention (MSA) module for extracting fine-grained feature subsets, which can obtain richer multi-scale interaction information. In the scene coding stream, a Relational Attention (RA) module is applied to construct the hidden relationship between the face and contextual features of non-facial regions by capturing the pairwise similarity. In order to verify the effectiveness and accuracy of our model, a large number of experiments are carried out on two public large-scale static facial expression image datasets, CAER-S and NCAER-S. By comparing the performance of our MCF-Net with other methods, the proposed model achieves superior results on two in-the-wild FER benchmarks: CAER-S with an accuracy of 81.82% and NCAER-S with an accuracy of 45.59%.

Full Text