Synthesizing speech animation by learning compact speech co-articulation models

Zhigang Deng Zhigang Deng,J.P Lewis,U Neumann

doi:10.1109/cgi.2005.1500361

Abstract

While speech animation fundamentally consists of a sequence of phonemes over time, sophisticated animation requires smooth interpolation and co-articulation effects, where the preceding and following phonemes influence the shape of a phoneme. Co-articulation has been approached in speech animation research in several ways, most often by simply smoothing the mouth geometry motion over time. Data-driven approaches tend to generate realistic speech animation, but they need to store a large facial motion database, which is not feasible for real time gaming and interactive applications on platforms such as PDAs and cell phones. In this paper we show that accurate speech co-articulation model with compact size can be from facial motion capture data. An initial phoneme sequence is generated automatically from text-to-speech (TTS) systems. Then, our co-articulation model is applied to the resulting phoneme sequence, producing natural and detailed motion. The contribution of this work is that speech co-articulation models learned from real human motion data can be used to generate natural-looking speech motion while simultaneously preserving the expressiveness of the animation via keyframing control. Simultaneously, this approach can be effectively applied to interactive applications due to its compact size.

Full Text