Video-realistic synthetic speech with a parametric visual speech synthesizer

Sascha Fagel

doi:10.21437/interspeech.2004-422

Abstract

The author presents a new face module for MASSY, the Modular Audiovisual Speech SYnthesizer [1]. Within this face module the system combines two approaches of visual speech synthesis. Although the articulation space is parameterized in terms of movements of the articulators, the visual synthesis is image based (video-realistic). The high-level visual speech synthesis generates a sequence of control commands for the visible articulation. The video synthesis searches an image database for appropriate video frames. If no image with facial properties according to the control commands is found, the missing image is generated by deforming a neutral image. MPEG-4 facial definition parameters (FDPs) [2] and additional points in the mouth opening area and around the lower jaw are defined in the neutral image as feature points. A twodimensional displacement vector is defined for each feature point. For the image deformation a mesh of triangles connecting the feature points is used. The displacement vector of a point in a triangle is interpolated from the displacement vectors of the vertices. Hence, the video synthesis algorithm is capable to use either a database of appropriately annotated video frames or a single neutral image with specified feature points and displacement vectors. A simple software tool for marking the feature points in the image was developed. Other well known data (image) based audio-visual speech synthesis systems like MIKETALK [3] and VIDEO REWRITE [4] concatenate prerecorded video sequences. The presented system demonstrates the compatibility of parametric and data based visual speech synthesis approaches.

Full Text