Abstract

Facial Expression Recognition (FER) has achieved remarkable progress as a result of using Convolutional Neural Networks (CNN). Relying on the spatial locality, convolutional filters in CNN, however, fail to learn long-range inductive biases between different facial regions in most neural layers. As such, the performance of a CNN-based model for FER is still limited. To address this problem, this paper introduces a novel FER framework with two attention mechanisms for CNN-based models, and these two attention mechanisms are used for the low-level feature learning the high-level semantic representation, respectively. In particular, in the low-level feature learning, a grid-wise attention mechanism is proposed to capture the dependencies of different regions from a facial expression image such that the parameter update of convolutional filters in low-level feature learning is regularized. In the high-level semantic representation, a visual transformer attention mechanism uses a sequence of visual semantic tokens (generated from pyramid features of high convolutional layer blocks) to learn the global representation. Extensive experiments have been conducted on three public facial expression datasets, CK+, FER+, and RAF-DB. The results show that our FER-VT has achieved state-of-the-art performance on these datasets, especially with a 100% accuracy on CK + datasets without any extra training data.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.