Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Ahsan Adeel,Amir Hussain,William M Whitmer,Mandar Gogate

doi:10.1109/tetci.2019.2917039

Abstract

This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

Highlights

S PEECH enhancement aims to enhance perceived overall speech quality and intelligibility, when noise degrades them significantly
Comparative simulation results demonstrate that the proposed EVWF approach outperforms benchmark audio-only approaches at very low SNRs (−12 dB, −6 dB, −3 dB, and 0 dB), and that the improvement is statistically significant at the 95% confidence level
The estimated lip-reading driven audio features are exploited by the designed novel EVWF for speech enhancement

Summary

INTRODUCTION

S PEECH enhancement aims to enhance perceived overall speech quality and intelligibility, when noise degrades them significantly. Researchers have proposed deep learning based advanced speech recognition [5] and enhancement [6] approaches Most of these are based on single channel (audio only) processing, which often perform poorly in adverse conditions, where overwhelming noise is present [7]. We propose a novel deep learning based lip-reading regression model and EVWF for speech enhancement. A stacked LSTM based data-driven model is proposed to approximate the clean audio features using only temporal visual features (i.e. lip reading). 4) Evaluated the potential of our proposed speech enhancement framework, exploiting both ideal AV mapping (ideal visual to audio feature mapping with exact clean audio features) and designed stacked LSTM based lip-reading model.

SPEECH ENHANCEMENT FRAMEWORK

Enhanced Visually Derived Wiener Filter

Lip Reading Regression Model

Dataset

Visual Feature Extraction

Methodology

Initial Data Analysis With MLP

Multiple Visual Frames to Audio Mapping Using MLP and LSTM

Speech Enhancement

DISCUSSION AND CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Emerging Topics in Computational Intelligence	Publication Date: Jun 1, 2021
Citations: 36	License type: cc-by

R Discovery Prime

R Discovery Prime

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Emerging Topics in Computational Intelligence

Lead the way for us

Similar Papers

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments
Ahsan Adeel ... Amir Hussain
Information Fusion | VOL. 59
Ahsan Adeel, et. al.Ahsan Adeel ... Amir Hussain
19 Aug 2019
Information Fusion | VOL. 59

Incorporating Visual Information Reconstruction into Progressive Learning for Optimizing audio-visual Speech Enhancement
Chen-Yue Zhang ... Bao-Cai Yin
-
Chen-Yue Zhang, et. al.Chen-Yue Zhang ... Bao-Cai Yin
04 Jun 2023
04 Jun 2023

Efficient Audio-Visual Speech Enhancement Using Deep U-Net With Early Fusion of Audio and Video Information and RNN Attention Blocks
Jung-Wook Hwang ... Rae-Hong Park
IEEE Access | VOL. 9
Jung-Wook Hwang, et. al.Jung-Wook Hwang ... Rae-Hong Park
01 Jan 2020
IEEE Access | VOL. 9

Audiovisual Enhancement of Speech Perception in Noise by School-Age Children Who Are Hard of Hearing.
Kaylah Lalonde ... Ryan W Mccreery
Ear & Hearing | VOL. 41
Kaylah Lalonde, et. al.Kaylah Lalonde ... Ryan W Mccreery
01 Jul 2020
Ear & Hearing | VOL. 41

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Transactions on Emerging Topics in Computational Intelligence