E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

Nasir Saleem,Jiechao Gao,Muhammad Irfan,Elena Verdu,Javier Parra Fuente

doi:10.1016/j.imavis.2022.104389

Abstract

Speechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Journal: Image and Vision Computing	Publication Date: Jan 31, 2022
Citations: 11

Similar Papers

DNN-based Speech Enhancement for Improving Speech Quality and Intelligibility Simultaneously
Ge Zhan ... Wenjing Wei
-
Ge Zhan, et. al.Ge Zhan ... Wenjing Wei
01 Dec 2019
01 Dec 2019

Enhancement of speech in noise using multi-channel, time-varying gains derived from the temporal envelope
Rahim Soleymanpour ... Insoo Kim
Applied Acoustics | VOL. 190
Rahim Soleymanpour, et. al.Rahim Soleymanpour ... Insoo Kim
24 Jan 2022
Applied Acoustics | VOL. 190

Inter-patient arrhythmia classification with improved deep residual convolutional neural network
Yuanlu Li ... Kun Li
Computer Methods and Programs in Biomedicine | VOL. 214
Yuanlu Li, et. al.Yuanlu Li ... Kun Li
12 Dec 2021
Computer Methods and Programs in Biomedicine | VOL. 214

Performance analysis of various training targets for improving speech quality and intelligibility
Shoba Sivapatham ... Rajavel Ramadoss
Applied Acoustics | VOL. 175
Shoba Sivapatham, et. al.Shoba Sivapatham ... Rajavel Ramadoss
17 Dec 2020
Applied Acoustics | VOL. 175

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

E2E-V2SResNet: Deep residual convolutional neural networks for end-to-end video driven speech synthesis

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing