Abstract

Facial expression recognition (FER) in the wild is challenging due to the disturbing factors including pose variation, occlusions, and illumination variation. The attention mechanism can relieve these issues by enhancing expression-relevant information and suppressing expression-irrelevant information. However, most methods utilize the same attention mechanism on feature tensors with varying spatial and channel sizes across different network layers, disregarding the dynamically changing sizes of these tensors. To solve this issue, this paper proposes a hierarchical attention network with progressive feature fusion for FER. Specifically, first, to aggregate diverse complementary features, a diverse feature extraction module based on several feature aggregation blocks is designed to exploit both local context and global context features, both low-level and high-level features, as well as the gradient features that are robust to illumination variation. Second, to effectively fuse the above diverse features, a hierarchical attention module (HAM) is designed to progressively enhance discriminative features from key parts of the facial images and suppress task-irrelevant features from disturbing facial regions. Extensive experiments show that our model achieves the best performance among existing FER methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call