LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading.

Leyuan Qu,Cornelius Weber,Stefan Wermter

doi:10.1109/tnnls.2022.3191677

Leyuan Qu, Cornelius Weber + Show 1 more

Open Access

https://doi.org/10.1109/tnnls.2022.3191677

Copy DOI

Abstract

The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on ∼2400 -h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE transactions on neural networks and learning systems	Publication Date: Feb 1, 2024
Citations: 11	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems

Lead the way for us

Similar Papers

L2 pronunciation quality in read and spontaneous speech
Helmer Strik ... Diana Binnenpoorte
-
Helmer Strik, et. al.Helmer Strik ... Diana Binnenpoorte
16 Oct 2000
16 Oct 2000

Lip Reading using Deep Learning
Robin Anburaj B
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08
Robin Anburaj BRobin Anburaj B
29 Jun 2024
INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT | VOL. 08

MyLipper: A Personalized System for Speech Reconstruction using Multi-view Visual Feeds
Yaman Kumar ... Rohit Jain
-
Yaman Kumar, et. al.Yaman Kumar ... Rohit Jain
01 Dec 2018
01 Dec 2018

Lip movements recognition towards an automatic lip reading system for Myanmar consonants
Thein Thein ... Kalyar Myo San
-
Thein Thein, et. al.Thein Thein ... Kalyar Myo San
01 May 2018
01 May 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems