Abstract

This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lip-reading regression model is employed. In the second level, lip-reading approximated clean-audio features are exploited, using an enhanced, visually-derived Wiener filter (EVWF), for estimating the clean audio power spectrum. Specifically, a stacked long-short-term memory (LSTM) based lip-reading regression model is designed for estimating the clean audio features using only temporal visual features (i.e., lip reading), by considering a range of prior visual frames. For clean speech spectrum estimation, a new filterbank-domain EVWF is formulated, which exploits the estimated speech features. The EVWF is compared with conventional spectral subtraction and log-minimum mean-square error methods using both ideal AV mapping and LSTM driven AV mapping approaches. The potential of the proposed AV speech enhancement framework is evaluated under four different dynamic real-world scenarios [cafe, street junction, public transport, and pedestrian area] at different SNR levels (ranging from low to high SNRs) using benchmark grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality is used to evaluate the quality of restored speech. For subjective testing, the standard mean-opinion-score method is used with inferential statistics. Comparative simulation results demonstrate significant lip-reading and speech enhancement improvements in terms of both speech quality and speech intelligibility. Ongoing work is aimed at enhancing the accuracy and generalization capability of the deep learning driven lip-reading model, using contextual integration of AV cues, leading to context-aware, autonomous AV speech enhancement.

Highlights

  • S PEECH enhancement aims to enhance perceived overall speech quality and intelligibility, when noise degrades them significantly

  • Comparative simulation results demonstrate that the proposed EVWF approach outperforms benchmark audio-only approaches at very low SNRs (−12 dB, −6 dB, −3 dB, and 0 dB), and that the improvement is statistically significant at the 95% confidence level

  • The estimated lip-reading driven audio features are exploited by the designed novel EVWF for speech enhancement

Read more

Summary

INTRODUCTION

S PEECH enhancement aims to enhance perceived overall speech quality and intelligibility, when noise degrades them significantly. Researchers have proposed deep learning based advanced speech recognition [5] and enhancement [6] approaches Most of these are based on single channel (audio only) processing, which often perform poorly in adverse conditions, where overwhelming noise is present [7]. We propose a novel deep learning based lip-reading regression model and EVWF for speech enhancement. A stacked LSTM based data-driven model is proposed to approximate the clean audio features using only temporal visual features (i.e. lip reading). 4) Evaluated the potential of our proposed speech enhancement framework, exploiting both ideal AV mapping (ideal visual to audio feature mapping with exact clean audio features) and designed stacked LSTM based lip-reading model.

SPEECH ENHANCEMENT FRAMEWORK
Enhanced Visually Derived Wiener Filter
Lip Reading Regression Model
Dataset
Visual Feature Extraction
Methodology
Initial Data Analysis With MLP
Multiple Visual Frames to Audio Mapping Using MLP and LSTM
Speech Enhancement
DISCUSSION AND CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.