Abstract

Speech conveys not only the verbal communication, but also emotions, manifested as facial expressions of the speaker. In this article, we present deep learning frameworks that directly infer facial expressions from just speech signals. Specifically, the time-varying contextual non-linear mapping between audio stream and micro facial movements is realized by our proposed recurrent neural networks to drive a 3D blendshape face model in real-time. Our models not only activate appropriate facial action units (AUs), defined as 3D expression blendshapes in the FaceWarehouse database, to depict different utterance generating actions in the form of lip movements, but also, without any assumption, automatically estimate emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. In the baseline models, conventional handcrafted acoustic features are utilized to predict facial actions. Furthermore, we show that it is more advantageous to learn meaningful acoustic feature representation from speech spectrograms with convolutional nets, which subsequently improves the accuracy of facial action synthesis. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approaches. Being speaker-independent, our generalized models are readily applicable to various tasks in human-machine interaction and animation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call