Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

Ahsan Adeel,Mandar Gogate,Amir Hussain

doi:10.1016/j.inffus.2019.08.008

Ahsan Adeel, Mandar Gogate + Show 1 more

Open Access

https://doi.org/10.1016/j.inffus.2019.08.008

Copy DOI

Abstract

Human speech processing is inherently multi-modal, where visual cues (e.g. lip movements) can help better understand speech in noise. Our recent work [1] has shown that lip-reading driven, audio-visual (AV) speech enhancement can significantly outperform benchmark audio-only approaches at low signal-to-noise ratios (SNRs). However, consistent with our cognitive hypothesis, visual cues were found to be relatively less effective for speech enhancement at high SNRs, or low levels of background noise, where audio-only (A-only) cues worked adequately. Therefore, a more cognitively-inspired, context-aware AV approach is required, that contextually utilises both visual and noisy audio features, and thus more effectively accounts for different noisy conditions. In this paper, we introduce a novel context-aware AV speech enhancement framework that contextually exploits AV cues with respect to different operating conditions, in order to estimate clean audio, without requiring any prior SNR estimation. In particular, an AV switching module is developed by integrating a convolutional neural network (CNN) and long-short-term memory (LSTM) network, that learns to contextually switch between visualonly (V-only), A-only and both AV cues at low, high and moderate SNR levels, respectively. For testing, the estimated clean audio features are utilised using an innovative, enhanced visually-derived Wiener filter (EVWF) for noisy speech filtering. The context-aware AV speech enhancement framework is evaluated in dynamic real-world scenarios (including cafe, street, bus, and pedestrians) at different SNR levels (ranging from low to high SNRs), using benchmark Grid and ChiME3 corpora. For objective testing, perceptual evaluation of speech quality (PESQ) is used to evaluate the quality of the restored speech. For subjective testing, the standard mean-opinion-score (MOS) method is used. Comparative experimental results show the superior performance of our proposed context-aware AV approach, over A-only, V-only, spectral subtraction (SS), and log-minimum mean square error (LMMSE) based speech enhancement methods, at both low and high SNRs. The preliminary findings demonstrate the capability of our novel approach to deal with spectro-temporal variations in real-world noisy environments, by contextually exploiting the complementary strengths of audio and visual cues. In conclusion, our contextual deep learning-driven AV framework is posited as a benchmark resource for the multi-modal speech processing and machine learning communities.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information Fusion	Publication Date: Aug 19, 2019
Citations: 59	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

Abstract

Talk to us

Similar Papers

More From: Information Fusion

Lead the way for us

Similar Papers

Lip-Reading Driven Deep Learning Approach for Speech Enhancement
Ahsan Adeel ... William M Whitmer
IEEE Transactions on Emerging Topics in Computational Intelligence | VOL. 5
Ahsan Adeel, et. al.Ahsan Adeel ... William M Whitmer
01 Jun 2021
IEEE Transactions on Emerging Topics in Computational Intelligence | VOL. 5

TDS-1 GNSS reflectometry wind geophysical model function response to GPS block types
Fade Chen ... Mohamed Freeshah
Geo-spatial Information Science | VOL. 25
Fade Chen, et. al.Fade Chen ... Mohamed Freeshah
09 Jan 2022
Geo-spatial Information Science | VOL. 25

Speech Denoising via Low-Rank and Sparse Matrix Decomposition
Jianjun Huang
ETRI Journal | VOL. 36
Jianjun HuangJianjun Huang
01 Feb 2014
ETRI Journal | VOL. 36

Mitigating 802.11 Mac Anomaly Using SNR To Control Backoff Contention Window
O C Branquinho ... D M Ferreira
-
O C Branquinho, et. al.O C Branquinho ... D M Ferreira
01 Jul 2006
Mitigating 802.11 Mac Anomaly Using SNR To Control Backoff Contention Window
O C Branquinho ... D M Ferreira

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments

Abstract

Talk to us

Similar Papers

More From: Information Fusion