Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP

Christian Benoı̂T,Bertrand Le Goff

doi:10.1016/s0167-6393(98)00045-4

Abstract

Since 1990, a series of visual speech synthesizers have been developed and synchronized with a French text-to-speech synthesizer at the ICP in Grenoble. In this article, we describe the different structures of these visual synthesizers. The techniques used include key-frame approaches based on 24 lip/chin images carefully selected to account for most of the basic coarticulated shapes in French, 2D parametric models of the lip contours, and finally 3D parametric models of the main components of the face. The successive versions were systematically evaluated, with the same reference corpus, according to a standard procedure. Auditory intelligibility and audio-visual intelligibility were compared under several conditions of acoustic distortion to evaluate the benefit of speechreading. Tests were run with acoustic material produced by a text-to-speech synthesizer or by a reference human speaker. Our results show that while visual speech is unnecessary under clear acoustic conditions, it adds intelligibility to the auditory information when the acoustics are degraded. Furthermore, the intelligibility provided by the visual channel increased constantly through successive improvements of our text-to-visual speech synthesizers.

Full Text