Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Ausdang Thangthai,Sarah Taylor,Ben Milner

doi:10.21437/interspeech.2016-1084

Abstract

This paper examines methods to improve visual speech synthesis from a text input using a deep neural network (DNN). Two representations of the input text are considered, namely into phoneme sequences or dynamic viseme sequences. From these sequences, contextual features are extracted that include information at varying linguistic levels, from frame level down to the utterance level. These are extracted from a broad sliding window that captures context and produces features that are input into the DNN to estimate visual features. Experiments first compare the accuracy of these visual features against an HMM baseline method which establishes that both the phoneme and dynamic viseme systems perform better with best performance obtained by a combined phoneme-dynamic viseme system. An investigation into the features then reveals the importance of the frame level information which is able to avoid discontinuities in the visual feature sequence and produces a smooth and realistic output.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Visual speech synthesis based on Chinese dynamic visemes
Hui Zhao ... Chaojing Tang
-
Hui Zhao, et. al. Hui Zhao ... Chaojing Tang
01 Jun 2008
01 Jun 2008

Synthesising visual speech using dynamic visemes and deep learning architectures
Ausdang Thangthai ... Sarah Taylor
Computer Speech & Language | VOL. 55
Ausdang Thangthai, et. al.Ausdang Thangthai ... Sarah Taylor
16 Nov 2018
Computer Speech & Language | VOL. 55

Speaker-Independent Speech-Driven Visual Speech Synthesis using Domain-Adapted Acoustic Models
Ahmed Hussen Abdelaziz ... Thibaut Weise
-
Ahmed Hussen Abdelaziz, et. al.Ahmed Hussen Abdelaziz ... Thibaut Weise
14 Oct 2019
14 Oct 2019

Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking
Vedran Vukotić ... Christian Raymond
-
Vedran Vukotić, et. al.Vedran Vukotić ... Christian Raymond
16 Oct 2016
16 Oct 2016

Publication Date: Sep 8, 2016
Citations: 2	License type: mit

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Visual Speech Synthesis Using Dynamic Visemes, Contextual Features and DNNs

Abstract

Talk to us

Similar Papers