Synthesizing speech acoustics from head and face motion

Adriano V Barbosa,Andreas Daffertshofer,Hani C Yehia,Eric Vatikiotis‐Bateson

doi:10.1121/1.4788446

Abstract

This work outlines a quantitative analysis of the relation between speech acoustics and the face and head motions that occur simultaneously [A. V. Barbosa, Ph.D. thesis, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil, 2004]. 2-D motion data is obtained by means of a video camera. An algorithm has been developed for tracking markers on the speaker’s face from the acquired video sequence [A. V. Barbosa, E. Vatikiotis-Bateson, and A. Daffertshofer, in Proceedings of the 8th ICSLP Interspeech 2004, Korea, 2004]. The motion domain is represented by the 2-D marker trajectories, whereas line spectrum pairs (LSP) coefficients and the fundamental frequency F0 are used to represent the speech acoustics domain. Mathematical models are trained to estimate the acoustic parameters (LSPs + F0) from the motion parameters (2-D marker positions). The estimated acoustic parameters are then used to synthesize the acoustic speech signal. Cross-domain analysis for undecomposed (i.e., full head + face) and decomposed (i.e., separated head and face) normalized 2-D motions is performed. Syntheses from each method using intelligibility tests and qualitative comparison of the original and synthesized utterances are being evaluated.

Full Text