Existing facial expression recognition (FER) methods train encoders with different large-scale training data for specific FER applications. In this paper, we propose a new task in this field. This task aims to pre-train a general encoder to extract any facial expression representations without fine-tuning. To tackle this task, we extend the self-supervised contrastive learning to pre-train a general encoder for facial expression analysis. To be specific, given a batch of facial expressions, some positive and negative pairs are firstly constructed based on coarse-grained labels and a FER-specified data augmentation strategy. Secondly, we propose the coarse-contrastive (CRS-CONT) learning, where the features of positive pairs are pulled together, while pushed away from the features of negative pairs. Moreover, one key event is that the excessive constraint on the coarse-grained feature distribution will affect fine-grained FER applications. To address this, a weight vector is designed to control the optimization of the CRS-CONT learning. As a result, a well-trained general encoder with frozen weights could preferably adapt to different facial expressions and realize the linear evaluation on any target datasets. Extensive experiments on both in- the-wild and in- the-lab FER datasets show that our method provides superior or comparable performance against state-of-the-art FER methods, especially on unseen facial expressions and cross-dataset evaluation. We hope that this work will help to reduce the training burden and develop a new solution against the fully-supervised feature learning with fine-grained labels. Code and the general encoder will be publicly available at https://github.com/hangyu94/CRS-CONT.
Read full abstract