Action unit detection by exploiting spatial-temporal and label-wise attention with transformer

Lingfeng Wang,Kenji Suzuki,Jin Qi,Jian Cheng

doi:10.1109/cvprw56347.2022.00276

Abstract

The facial action units (FAU) defined by the Facial Action Coding System (FACS) has become an important approach of facial expression analysis. Most work on FAU detection only considers the spatial-temporal feature and ignores the label-wise AU correlation. In practice, the strong relationships between facial AUs can help AU detection. We proposed a transformer based FAU detection model by leverage both the local spatial-temporal features and label-wise FAU correlation. To be specific, we firstly designed a visual spatial-temporal transformer based model and a convolution based audio model to extract action unit specific features. Secondly, inspired by the relationship between FAUs, we proposed a transformer based correlation module to learn correlation between AUs. The action unit specific features from aural and visual models are further aggregated in the correlation modules to produce per-frame prediction of 12 AUs. Our model was trained on Aff-Wild2 dataset of the ABAW3 challenge and achieved state of art performance in the FAU task, which verified that the effectiveness of the proposed network.

Full Text