Abstract

Emotive audio-visual avatars are virtual computer agents which have the potential of improving the quality of human-machine interaction and human-human communication significantly. However, the understanding of human communication has not yet advanced to the point where it is possible to make realistic avatars that demonstrate interactions with natural-sounding emotive speech and realistic-looking emotional facial expressions. In this paper, We propose the various technical approaches of a novel multimodal framework leading to a text-driven emotive audio-visual avatar. Our primary work is focused on emotive speech synthesis, realistic emotional facial expression animation, and the co-articulation between speech gestures (i.e., lip movements) and facial expressions. A general framework of emotive text-to-speech (TTS) synthesis using a diphone synthesizer is designed and integrated into a generic 3-D avatar face model. Under the guidance of this framework, we therefore developed a realistic 3-D avatar prototype. A rule-based emotive TTS synthesis system module based on the Festival-MBROLA architecture has been designed to demonstrate the effectiveness of the framework design. Subjective listening experiments were carried out to evaluate the expressiveness of the synthetic talking avatar.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call