Abstract
Facial action unit (AU) detection is a challenging task due to the variety and subtlety of individuals' facial behavior. Facial muscle characteristics such as temporal dependencies and action correlations make AU detection differ from general multi-label classification tasks, and capturing these two characteristics is the key to accurate AU detection. However, there is little work to date taking both of them into consideration concurrently. To capture the AU correlations in an image, we first disentangle the global (image) feature into multiple AU-specific features with an AU contrastive loss, and then we compute the feature for each AU by aggregating the features from the other AUs with a self-attention based transformer. Different from the original transformer, we embed the AU semantic dependency matrix into it to weakly guide the attention learning. We then weighted fuse the AU-wise features to obtain the frame-wise features. We further capture the temporal dependencies among frames by using another attention-based transformer, which achieves information aggregation from the prior frames. Extensive experiments on two benchmark datasets (i.e., BP4D and DISFA) demonstrate that the proposed framework outperforms the state-of-the-art approaches.
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have