Speech-to-lip movement synthesis maximizing audio-visual joint probability based on EM algorithm

Satoshi Nakamura ,Eli Yamamoto ,Kiyohiro Shikano

doi:10.1023/a:1008179732362

Abstract

We investigate methods using the hidden Markov model (HMM) to drive a lip movement sequence with input speech. We have already investigated a mapping method based on the Viterbi decoding algorithm which converts an input speech to a lip movement sequence through the most likely HMM state sequence conducted by audio HMMs. However, the method contains a substantial problem of producing errors along incorrectly decoded HMM states. This paper newly proposes a method to re-estimate the visual parameters using the HMMs of the audio-visual joint probability under the expectation-maximization (EM) algorithm. In experiments, the proposed mapping method using the EM algorithm shows an error reduction of 26% compared to a method using the Viterbi algorithm at incorrectly decoded bi-labial consonants.

Full Text