Abstract

Realistic animation of faces has proven to be challenging due to our sensitivity in perceiving visual articulation errors. We study the problem of learning to automatically generate natural speech-related facial motion from audio speech which can be used to drive both CG and robotic talking heads with low bandwidth requirements and low latency. A many-to-one mapping from acoustic phones to lip shapes (i.e., static “visemes”) is a poor approximation to the complex, context-dependent relationship visual speech truly has with acoustic speech production. We introduced “dynamic visemes” as data-derived visual-only speech units associated with distributions of phone strings and demonstrated they capture context and co-articulation. Further improvement in predicting visual speech can be achieved using an end-to-end deep learning approach. We train a sliding window deep neural network that learns a mapping from a window of phone labels or acoustic features to a window of visual features. This approach removes the...

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.