Abstract

In this paper, we present a novel approach to speech- driven facial animation using a non-parametric switching state space model based on Gaussian processes. The model is an extension of the shared Gaussian process dynamical model, augmented with switching states. Two talking head corpora are processed by extracting visual and audio data from the sequences followed by a parameterization of both data streams. Phonetic labels are obtained by performing forced phonetic alignment on the audio. The switching states are found using a variable length Markov model trained on the labelled phonetic data. The audio and visual data corresponding to phonemes matching each switching state are extracted and modelled together using a shared Gaussian process dynamical model. We propose a synthesis method that takes into account both previous and future phonetic context, thus accounting for forward and backward coarticulation in speech. Both objective and subjective evaluation results are presented. The quantitative results demonstrate that the proposed method outperforms other state-of-the-art methods in visual speech synthesis and the qualitative results reveal that the synthetic videos are comparable to ground truth in terms of visual perception and intelligibility.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.