A joint local spatial and global temporal CNN-Transformer for dynamic facial expression recognition

Linhuang Wang,Xin Kang,Fei Ding,Satoshi Nakagawa,Fuji Ren

doi:10.1016/j.asoc.2024.111680

Linhuang Wang, Xin Kang + Show 3 more

https://doi.org/10.1016/j.asoc.2024.111680

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Unlike conventional video action recognition, Dynamic Facial Expression Recognition (DFER) tasks exhibit minimal spatial movement of objects. Addressing this distinctive attribute, we propose an innovative CNN-Transformer model, named LSGTNet, specifically tailored for DFER tasks. Our LSGTNet comprises three stages, each composed of a spatial CNN (Spa-CNN) and a temporal transformer (T-Former) in sequential order. The Spa-CNN extracts spatial features from images, yielding smaller-sized feature maps to alleviate the computational complexity for subsequent T-Former. The T-Former integrates global temporal information from the same spatial positions across different time frames while retaining the feature map dimensions. The alternating interplay between Spa-CNN and T-Former ensures a continuous fusion of spatial and temporal information, leading our model to excel across various real-world datasets. To the best of our knowledge, this is the first method to address the DFER challenge by focusing on capturing the temporal changes in muscles within local spatial regions. Our method has achieved state-of-the-art results on multiple in-the-wild datasets and datasets under laboratory conditions.

Full Text