Abstract

Facial expression recognition (FER) in the wild is extremely challenging due to occlusions, variant head poses under unconstrained conditions and incorrect annotations (e.g., label noise). In this paper, we aim to improve the performance of in-the-wild FER with Transformers and online label correction. Different from pure CNNs based methods, we propose a Transformer-augmented network (TAN) to dynamically capture the relationships within each facial patch and across the facial patches. Specifically, the TAN translates a number of facial patch images into a set of visual feature sequences by a backbone convolutional neural network. The intra-patch Transformer is subsequently utilized to capture the most discriminative features within each visual feature sequence. The position-disentangled attention mechanism of the intra-patch Transformer is proposed to better incorporate the positional information for feature sequences. Furthermore, we propose the inter-patch Transformer to model the dependencies across these feature sequences. More importantly, we present the online label correction (OLC) framework to correct suspicious hard labels and accumulate soft labels based on the predictions of the model, which strengthens the robustness of our model against label noise. We validate our method on several widely-used datasets (RAF-DB, FERPlus, AffectNet), realistic occlusion and pose variation datasets, and synthetic noisy datasets. Extensive experiments on these benchmarks demonstrate that the proposed method performs favorably against state-of-the-art methods. The source code will be made publicly available.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call