Abstract

The facial action units (FAU) defined by the Facial Action Coding System (FACS) has become an important approach of facial expression analysis. Most work on FAU detection only considers the spatial-temporal feature and ignores the label-wise AU correlation. In practice, the strong relationships between facial AUs can help AU detection. We proposed a transformer based FAU detection model by leverage both the local spatial-temporal features and label-wise FAU correlation. To be specific, we firstly designed a visual spatial-temporal transformer based model and a convolution based audio model to extract action unit specific features. Secondly, inspired by the relationship between FAUs, we proposed a transformer based correlation module to learn correlation between AUs. The action unit specific features from aural and visual models are further aggregated in the correlation modules to produce per-frame prediction of 12 AUs. Our model was trained on Aff-Wild2 dataset of the ABAW3 challenge and achieved state of art performance in the FAU task, which verified that the effectiveness of the proposed network.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.