Abstract

The detection of facial muscle movements (e.g., mouth opening) is crucial for facial expression recognition (FER). However, extracting these facial motion features is challenging for a deep-learning recognition system for the following reasons: (1) without explicit labels of motion for training, there is no guarantee that convolutional neural networks (CNNs) can extract motions effectively; (2) compared to human action recognition (e.g., the object moving from left to right), some facial motions (e.g., raising eyebrows) are more subtle and thus harder to extract; and (3) the use of optical flow to extract motion features is time-consuming when using a commonly-used camera. In this work, we propose a Multi-Scale Correlation Module (MSCM) together with an adaptive fusion. Firstly, large as well as small facial motions are extracted by MSCM and encoded by CNNs. Then, an adaptive fusion module is used to aggregate motion features. With these modules, our recognition network is able to model both subtle and large motion features for video-based FER with only the RGB image frames as input. Experiments on two datasets, AFEW and DFEW, show that the network achieves state-of-art performances on the benchmarks.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call