Abstract

We have constructed an audio-visual text-to-speech synthesizer for Finnish by combining a dynamic facial model with an acoustic speech synthesizer. The visual speech is based on a letter-to-viseme mapping and the animation is created by linear interpolation between the visemes. A viseme is defined by 12 parameter values. In a recent study we showed that visual speech increases the intelligibility of both natural and synthetic auditory speech [5]. We have upgraded our visual speech synthesis by adding the tongue model and improving the speech parameters on the basis of the intelligibility study. Here we show data from a new intelligibility study demonstrating the improved performance of the synthesizer. Presenting the visual speech in three-dimensional space did not further improve the intelligibility.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call