Lifelike talking faces for interactive services are an exciting new modality for man-machine interactions. Recent developments in speech synthesis and computer animation enable the real-time synthesis of faces that look and behave like real people, opening opportunities to make interactions with computers more like face-to-face conversations. This paper focuses on the technologies for creating lifelike talking heads, illustrating the two main approaches: model-based animations and sample-based animations. The traditional model-based approach uses three-dimensional wire-frame models, which can be animated from high-level parameters such as muscle actions, lip postures, and facial expressions. The sample-based approach, on the other hand, concatenates segments of recorded videos, instead of trying to model the dynamics of the animations in detail. Recent advances in image analysis enable the creation of large databases of mouth and eye images, suited for sample-based animations. The sample-based approach tends to generate more naturally looking animations at the expense of a larger size and less flexibility than the model-based animations. Beside lip articulation, a talking head must show appropriate head movements, in order to appear natural. We illustrate how such visual prosody is analyzed and added to the animations. Finally, we present four applications where the use of face animation in interactive services results in engaging user interfaces and an increased level of trust between user and machine. Using an RTP-based protocol, face animation can be driven with only 800 bits/s in addition to the rate for transmitting audio.
Read full abstract