Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis.

Guillaume Gibert,Kirk N Olsen,Yvonne Leung,Catherine J Stevens

doi:10.1186/s40469-015-0007-8

Abstract

BackgroundVirtual humans have become part of our everyday life (movies, internet, and computer games). Even though they are becoming more and more realistic, their speech capabilities are, most of the time, limited and not coherent and/or not synchronous with the corresponding acoustic signal.MethodsWe describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. In fact, speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. To access tongue trajectories (partially occluded speech articulator), electromagnetic articulography (EMA) is often used. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. In addition, semi-automatic segmentation was performed on the acoustic stream. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones.ResultsThe avatar’s facial key frames were converted into articulatory parameters steering its speech articulators (jaw, lips and tongue). The speech production database was used to drive the Embodied Conversational Agent (ECA) and to enhance its speech capabilities. A Text-To-Auditory Visual Speech synthesizer was created based on the MaryTTS software and on the diphone dictionary derived from the speech production database.ConclusionsWe describe a method to transform an ECA with generic tongue model and animation by key frames into a talking head that displays naturalistic tongue, jaw and lip motions. Thanks to a multimodal speech production database, a Text-To-Auditory Visual Speech synthesizer drives the ECA’s facial movements enhancing its speech capabilities.Electronic supplementary materialThe online version of this article (doi:10.1186/s40469-015-0007-8) contains supplementary material, which is available to authorized users.

Highlights

Virtual humans have become part of our everyday life
We propose an innovative method to transform an existing Embodied Conversational Agent (ECA) animated by interpolation between key frames into a talking head
The acoustic signal and the articulatory parameter trajectories are sent to the animation module, which plays the data i.e., the ECA speaks and moves his speech effectors

Summary

Methods

We describe a method to convert a virtual human avatar (animated through key frames and interpolation) into a more naturalistic talking head. Speech articulation cannot be accurately replicated using interpolation between key frames and talking heads with good speech capabilities are derived from real speech production data. Motion capture data are commonly used to provide accurate facial motion for visible speech articulators (jaw and lips) synchronous with acoustics. We recorded a large database of phonetically-balanced English sentences with synchronous EMA, motion capture data, and acoustics. An articulatory model was computed on this database to recover missing data and to provide ‘normalized’ animation (i.e., articulatory) parameters. A dictionary of multimodal Australian English diphones was created. It is composed of the variation of the articulatory parameters between all the successive stable allophones

Results

Conclusions

Background

Conclusions & perspectives

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Cognitive Science	Publication Date: Sep 8, 2015
Citations: 27	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Cognitive Science

Lead the way for us

Similar Papers

Online Subjective Assessment of the Speech of Deaf and Hard of Hearing Children
László Czap
Periodica Polytechnica Electrical Engineering and Computer Science | VOL. 62
László CzapLászló Czap
03 Dec 2018
Periodica Polytechnica Electrical Engineering and Computer Science | VOL. 62

Computer-Implemented Articulatory Models for Speech Production: A Review.
Bernd J Kröger
Frontiers in robotics and AI | VOL. 9
Bernd J KrögerBernd J Kröger
08 Mar 2022
Frontiers in robotics and AI | VOL. 9

An articulatory study of emotional speech production
Sungbok Lee ... Shrikanth Narayanan
-
Sungbok Lee, et. al.Sungbok Lee ... Shrikanth Narayanan
04 Sep 2005
04 Sep 2005

Enhancements to Online Help: Adaptivity and Embodied Conversational Agents
Jérôme Simonin ... Noëlle Carbonell
-
Jérôme Simonin, et. al.Jérôme Simonin ... Noëlle Carbonell
01 Jan 2009
01 Jan 2009

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transforming an embodied conversational agent into an efficient talking head: from keyframe-based animation to multimodal concatenation synthesis.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Cognitive Science