Abstract

Limited sizes of annotated video databases of spontaneous facial expression, imbalanced action unit labels, and domain shift are three main obstacles in training models to detect facial actions and estimate their intensity. To address these problems, we propose an approach that incorporates facial expression generation for facial action unit intensity estimation. Our approach reconstructs the 3D shape of the face from each video frame, aligns the 3D mesh to a canonical view, and trains a GAN-based network to synthesize novel images with facial action units of interest. We leverage the synthetic images to achieve two goals: 1) generating AU-balanced databases, and 2) tackling domain shift with personalized networks. To generate a balanced database, we synthesize expressions with varying AU intensities and perform semantic resampling. Our experimental results on FERA17 show that networks trained on synthesized facial expressions outperform those trained on actual facial expressions and surpass current state-of-the-art approaches. To tackle domain shift, we propose personalizing pretrained networks. We generate synthetic expressions of each target subject with varying AU intensity labels and use the person-specific synthetic images to fine-tune pretrained networks. To evaluate performance of the personalized networks, we use DISFA and PAIN databases. Personalized networks, which require only a single image from each target subject to generate synthetic images, achieved significant improvement in generalizing to unseen domains.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call