Learning from visual speech

Sarah Taylor,Ben Milner,Iain Matthews,Yisong Yue,Taehwan Kim

doi:10.1121/1.4969315

Abstract

Realistic animation of faces has proven to be challenging due to our sensitivity in perceiving visual articulation errors. We study the problem of learning to automatically generate natural speech-related facial motion from audio speech which can be used to drive both CG and robotic talking heads with low bandwidth requirements and low latency. A many-to-one mapping from acoustic phones to lip shapes (i.e., static “visemes”) is a poor approximation to the complex, context-dependent relationship visual speech truly has with acoustic speech production. We introduced “dynamic visemes” as data-derived visual-only speech units associated with distributions of phone strings and demonstrated they capture context and co-articulation. Further improvement in predicting visual speech can be achieved using an end-to-end deep learning approach. We train a sliding window deep neural network that learns a mapping from a window of phone labels or acoustic features to a window of visual features. This approach removes the...

Full Text