Parameterized visual speech synthesis and its evaluation

Riikka Möttönen ,J Kulju ,Jean-Luc Olives ,Mikko Sams

doi:10.5281/zenodo.37274

Abstract

We have constructed an audio-visual text-to-speech synthesizer for Finnish by combining a dynamic facial model with an acoustic speech synthesizer. The visual speech is based on a letter-to-viseme mapping and the animation is created by linear interpolation between the visemes. A viseme is defined by 12 parameter values. In a recent study we showed that visual speech increases the intelligibility of both natural and synthetic auditory speech [5]. We have upgraded our visual speech synthesis by adding the tongue model and improving the speech parameters on the basis of the intelligibility study. Here we show data from a new intelligibility study demonstrating the improved performance of the synthesizer. Presenting the visual speech in three-dimensional space did not further improve the intelligibility.

Full Text