Abstract
In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-realistic talking head synthesis. An audio-visual database of a person is recorded first for training a statistical Hidden Markov Model (HMM) of Lips movement. The HMM is then used to generate the dynamic trajectory of lips movement for given speech signals in the maximum probability sense. The generated trajectory is then used as a guide to select, from the original training database, an optimal sequence of lips images which are then stitched back to a background head video. We also propose a minimum generation error (MGE) training method to refine the audio-visual HMM to improve visual speech trajectory synthesis. Compared with the traditional maximum likelihood (ML) estimation, the proposed MGE training explicitly optimizes the quality of generated visual speech trajectory, where the audio-visual HMM modeling is jointly refined by using a heuristic method to find the optimal state alignment and a probabilistic descent algorithm to optimize the model parameters under the MGE criterion. In objective evaluation, compared with the ML-based method, the proposed MGE-based method achieves consistent improvement in the mean square error reduction, correlation increase, and recovery of global variance. For as short as 20 min recording of audio/video footage, the proposed system can synthesize a highly photo-realistic talking head in sync with the given speech signals (natural or TTS synthesized). This system won the first place in the A/V consistency contest in LIPS Challenge, perceptually evaluated by recruited human subjects.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have