Abstract

This paper studies a new method for three-dimensional (3D) facial model adaptation and its integration into a text-to-speech (TTS) system. The 3D facial adaptation requires a set of two orthogonal views of the user′s face with a number of feature points located on both views. Based on the correspondences of the feature points′ positions, a generic face model is deformed nonrigidly treating every facial part as a separate entity. A cylindrical texture map is then built from the two image views. The generated head models are compared to corresponding models obtained by the commonly used adaptation method that utilizes 3D radial bases functions. The generated 3D models are integrated into a talking head system, which consists of two distinct parts: a multilingual text to speech sub-system and an MPEG-4 compliant facial animation sub-system. Support for the Greek language has been added, while preserving lip and speech synchronization.

Highlights

  • Talking heads apply in a great variety of applications, such as human-machine interfaces (HMI) or virtual reality applications

  • The generated 3D models are integrated into a talking head system which consists of two distinct parts: a multilingual text-to-speech system and a facial animation engine based on MPEG-4 facial animation parameters (FAPs)

  • Synchronization between the sound file produced by multiband resynthesis pitch synchronous overlap add (MBROLA) and the FAP file produced by converting the phonemes to FAPs is achieved, since the durations of the phonemes are available to both the TTS and the facial animation engine

Read more

Summary

INTRODUCTION

Talking heads apply in a great variety of applications, such as human-machine interfaces (HMI) or virtual reality applications. In [8], the required 3D positions of the facial features are estimated from a series of captured image frames of the target face and the generic model is transformed by applying an interpolation function based on radial basis functions. Our approach differs from those of all above methods in that it treats the facial model as a collection of facial parts, which are allowed to deform according to separate affine transformations This is a simple method which does not require the use of any specialized equipment (as a 3D laser scanner) and is effective for face characterization because the physiological differences in characteristics between faces are based on precisely such local variations, that is, a person may have a longer or shorter nose, narrower eyes, and so forth. The generated 3D models are integrated into a talking head system which consists of two distinct parts: a multilingual text-to-speech system and a facial animation engine based on MPEG-4 facial animation parameters (FAPs). Required 3D positions, while the rest of the model nodes are displaced so that the natural characteristics of the human face are preserved

Calculation of the 3D positions of the features
Rigid adaptation
Nonrigid adaptation
THE TALKING HEAD SYSTEM
Text-to-speech module
Facial animation module
EXPERIMENTAL RESULTS
CONCLUSIONS
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call