Abstract

We describe here the control, shape and appearance models that are built using an original photogrammetric method to capture characteristics of speaker-specific facial articulation, anatomy, and texture. Two original contributions are put forward here: the trainable trajectory formation model that predicts articulatory trajectories of a talking face from phonetic input and the texture model that computes a texture for each 3D facial shape according to articulation. Using motion capture data from different speakers and module-specific evaluation procedures, we show here that this cloning system restores detailed idiosyncrasies and the global coherence of visible articulation. Results of a subjective evaluation of the global system with competing trajectory formation models are further presented and commented.

Highlights

  • Embodied conversational agents (ECAs)—virtual characters as well as anthropoid robots—should be able to talk with their human interlocutors

  • This minimal information can be enriched with details of the underlying phonological and informational structure of the message, with facial expressions, or with paralinguistic information that all have an impact on speech articulation

  • The acceptability and believability of these ECA depend on at least three factors: (a) the information-dependent factors that relate to the relevance of the linguistic content and paralinguistic settings of the messages, (b) the appropriate choice of voice quality, communicative and emotional facial expressions, gaze patterns, and so forth, adapted to situation and environmental conditions; (c) the signal-dependent factors that relate to the quality of the rendering of this information by multimodal signals

Read more

Summary

Introduction

Embodied conversational agents (ECAs)—virtual characters as well as anthropoid robots—should be able to talk with their human interlocutors. The acceptability and believability of these ECA depend on at least three factors: (a) the information-dependent factors that relate to the relevance of the linguistic content and paralinguistic settings of the messages, (b) the appropriate choice of voice quality, communicative and emotional facial expressions, gaze patterns, and so forth, adapted to situation and environmental conditions; (c) the signal-dependent factors that relate to the quality of the rendering of this information by multimodal signals This latter signal-dependent contribution depends again on two main factors: the intrinsic quality of each communicative channel, that is, intrinsic quality of synthesized speech, gaze, facial expressions, head movements, hand gestures and the quality of the interchannel coherence, that is, the proper coordination between audible and visible behavior of the recruited organs that enable intuitive perceptual fusion of these multimodal streams in an unique and coherent communication flow. We will notably show that the proposed statistical control model for audiovisual synchronization favorably competes with the solution that consists in concatenating multimodal speech segments

State of the Art
Cloning Speakers
The Trajectory Formation System
The Photorealistic Appearance Model
Subjective Evaluation
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.