Abstract

One key challenge in facial expression recognition (FER) is the extraction of discriminative features from critical facial regions. Because of their promising ability to learn discriminative features, visual attention mechanisms are increasingly used to address pattern recognition problems. This paper presents a novel multiple attention network that simulates humans’ coarse-to-fine visual attention to improve expression recognition performance. In the proposed network, a region-aware sub-net (RASnet) learns binary masks for locating expression-related critical regions with coarse-to-fine granularity levels and an expression recognition sub-net (ERSnet) with a multiple attention (MA) block learns comprehensive discriminative features. Embedded in the convolutional layers, the MA block fuses diversified attention using the learned masks from the RASnet. The MA block contains a hybrid attention branch with a series of sub-branches, where each sub-branch provides region-specific attention. To explore the complementary benefits of diversified attention, the MA block also has a weight learning branch that adaptively learns the contributions of the different critical regions. Experiments have been carried out on two publicly available databases, RAF and CK+, and the reported accuracies are 85.69% and 96.28%, respectively. The results indicate that our method achieves competitive or better performance than state-of-the-art methods.

Highlights

  • Expression, a common form of nonverbal communication, conveys important cues for emotional states and intentions

  • To learn discriminative features, our convolutional neural network (CNN) mimics visual attention that accounts for the diversity of expressions and individuals

  • Its 29,672 images are divided into single-label and two-tab subsets

Read more

Summary

Introduction

Expression, a common form of nonverbal communication, conveys important cues for emotional states and intentions. Human observers can pay selective attention to the expression-related parts of a facial image while screening out the irrelevant components, resulting in high-level FER performance. Several recent FER studies [8]–[10] used deep networks to mimic the attention mechanism and achieved excellent FER performance. Two of these studies [8], [9] adopted single-level (i.e., global-level) attention without any consideration for diversified saliencies, which may distract attention to expression-irrelevant components. Li et al [10] adopted region-level attention to examine the importance of different regions, but this method cannot learn discriminative features

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.