Learning facial action units with spatiotemporal cues and multi-label sampling

Wen-Sheng Chu,Fernando De La Torre,Jeffrey F Cohn

doi:10.1016/j.imavis.2018.10.002

Wen-Sheng Chu, Fernando De La Torre + Show 1 more

Open Access

https://doi.org/10.1016/j.imavis.2018.10.002

Copy DOI

Journal: Image and Vision Computing	Publication Date: Oct 28, 2018
Citations: 22	License type: elsevier-specific: oa user license

Affiliation: Carnegie Mellon University, University of Pittsburgh

Abstract

Facial action units (AUs) can be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during network training, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

Full Text