Abstract

Facial expression recognition rarely explores complex spatiotemporal dependencies among facial regions at different scales. This paper proposes a transformer-based three-layer hierarchical architecture that incorporates multi-scale spatiotemporal aggregation for dynamic facial expression recognition. The hierarchical structure consists of bottom-to-top layers, each comprising transformer encoders with local self-attention mechanisms. These encoders gradually expand their receptive fields through hierarchical spatiotemporal aggregation, enabling the modeling of spatiotemporal context dependencies among facial regions at different scales and across consecutive frames. Consequently, the bottom-to-top layers correspond to learning the fine-grained, coarse-grained, and global facial representations. To evaluate the performance of our proposed framework, we conducted extensive experiments on four public datasets. The comparison results demonstrate that our proposed framework outperforms the state-of-the-art, with accuracies of 79.09%, 62.19%, 64.85%, and 59.79% on the RML, eNTERFACE'05, RAVDESS, and AFEW datasets, respectively. Ablation experiments, statistical significance tests, and visualization analyses indicate that the proposed framework successfully learns emotional-salient facial representations.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call