Abstract

We describe how our Finnish talking head was improved by using a new auditory speech synthesis method based on neural networks and optimal synchronization of the facial speech animation and the audio signal. In our first version of the talking head, the user typed in text and synthesized auditory speech and synchronized facial animation were created automatically. We combine a 3D facial model with a commercial auditory text-to-speech synthetizer (TTS). The auditory speech is produced by concatenating pre-recorded samples of natural speech according to a set of rules. The quality of the current speech synthesis is not yet adequate. A new strategy has been developed to improve the TTS and to integrate auditory synthesizer synchronization, especially when hardware capabilities are limited. We are developing a new method to achieve an optimal synchronization, independent of the platform used. This method is based on predictive visual synthesis. The new synchronization method gives us better control over audio-visual speech synthesis in the time domain. Using the diphone duration, we can use a more realistic interpolation function between the visemes. Thus, we can also take into account coarticulation effects.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call