Abstract

Facial expression recognition (FER) is embedded in many real-world human-computer interaction tasks, such as online learning, depression recognition and remote diagnosis. However, FER is often hindered by privacy concerns and low recognition accuracy due to inadequate data transfer restrictions on public clouds, insufficient quantities of effective labeled samples and class imbalance. To address the above challenges, we have developed an automatic privacy-preserving learning state recognition system for supervising the quality of online teaching with the cooperation of edge servers and cloud servers to reduce the risk of privacy exposure. In particular, we propose few-shot facial expression recognition with a self-supervised vision transformer (SSF-ViT) by integrating self-supervised learning (SSL) and few-shot learning (FSL) to train a deep learning model with fewer labeled samples. Specifically, a vision transformer (ViT) is jointly pretrained with four self-supervised pretext tasks, including image denoising and reconstruction, image rotation prediction, jigsaw puzzle and masked patch prediction, to obtain a pretrained ViT encoder. Then, the pretrained ViT encoder is used on a lab-controlled labeled FER dataset to extract the spatiotemporal features and implement the FER task to fine-tune the parameters. Finally, we construct prototypes to verify the few-shot classification method for specific expression recognition. Support and query sets are divided in the wild FER dataset, and few-shot classification episodes are constructed. The fine-tuned ViT encoder is used as the feature extractor to build the prototype for each support set category, and the expression classification results are obtained by computing the Euclidean distance between the query samples and the prototypes. The extensive experimental results show that SSF-ViT can achieve recognition accuracies of 74.95%, 66.04%, 63.69% and 90.98% on the FER2013, AffectNet, SFEW 2.0 and RAF-DB datasets, respectively. In addition, SSF-ViT can improve the recognition performance of specific expression categories on these datasets.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call