Chapter 13 - Multimodal HCI Output: Facial Motion, Gestures and Synthesised Speech Synchronisation

Igor S. Pandžić

doi:10.1016/b978-0-12-374825-6.00002-2

Abstract

This chapter introduces a basic audio-visual (AV) speech synthesis system that generates simple lip motion from input text using a text-to-speech (TTS) system. The basic AV speech synthesis system provides satisfactory results for many applications that do not require fully accurate lip motion, especially if used with nonrealistic virtual characters such as cartoon figures. The TTS system is at the core of any AV speech synthesis system. TTS generates and outputs the speech sound based on plain text input in ASCII or UNICODE encoding. Crucially for AV speech synthesis, TTS provides information about the phonemes it generates and the time of their occurrence. Normally, any TTS system can provide such information; though they vary in the implementation of the mechanism for passing this information to other components. Common mechanisms include a callback function to be implemented by the information user and called by the TTS at appropriate times or a fast preprocessing step generating a list of phonemes and other events. The animation system receives the information about each phoneme at the latest by the time the phoneme gets pronounced and that there is a common time base between the TTS and the animation system. Based on this, the animation system can simply map each phoneme to a mouth shape and display that mouth shape at the appropriate time.

Full Text