Learning Continuous Facial Actions From Speech for Real-Time Animation

Hai X. Pham,Yuting Wang,Vladimir Pavlovic

doi:10.1109/taffc.2020.3022017

Abstract

Speech conveys not only the verbal communication, but also emotions, manifested as facial expressions of the speaker. In this article, we present deep learning frameworks that directly infer facial expressions from just speech signals. Specifically, the time-varying contextual non-linear mapping between audio stream and micro facial movements is realized by our proposed recurrent neural networks to drive a 3D blendshape face model in real-time. Our models not only activate appropriate facial action units (AUs), defined as 3D expression blendshapes in the FaceWarehouse database, to depict different utterance generating actions in the form of lip movements, but also, without any assumption, automatically estimate emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. In the baseline models, conventional handcrafted acoustic features are utilized to predict facial actions. Furthermore, we show that it is more advantageous to learn meaningful acoustic feature representation from speech spectrograms with convolutional nets, which subsequently improves the accuracy of facial action synthesis. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approaches. Being speaker-independent, our generalized models are readily applicable to various tasks in human-machine interaction and animation.

Full Text