Embedding Encoder-Decoder With Attention Mechanism for Monaural Speech Enhancement

Tian Lan,Qiao Liu,Wenzheng Ye,Yilan Lyu,Junyi Zhang

doi:10.1109/access.2020.2995346

Tian Lan, Qiao Liu + Show 3 more

Open Access

https://doi.org/10.1109/access.2020.2995346

Copy DOI

Abstract

The auditory selection framework with attention and memory (ASAM), which has an attention mechanism, embedding generator, generated embedding array, and life-long memory, is used to deal with mixed speech. When ASAM is applied to speech enhancement, the discrepancy between the voice and noise feature memories is huge and the separability of noise and voice is increased. However, ASAM cannot achieve desirable performance in terms of speech enhancement because it fails to utilize the time-frequency dependence of the embedding vectors to generate a corresponding mask unit. This work proposes a novel embedding encoder-decoder (EED), and a convolutional neural network (CNN) is used as decoder. The CNN structure is good at detecting local patterns, which can be exploited to extract correlation embedding data from the embedding array to generate the target spectrogram. This work evaluates a similar ASAM, EED with an LSTM encoder and a CNN decoder (RC-EED), RC-EED with an attention mechanism (RC-AEED), other similar EED structures and baseline models. Experiment results show that RC-EED and RC-AEED networks have good performance on speech enhancement task at low signal-to-noise ratio conditions. In addition, RC-AEED exhibits superior speech enhancement performance over ASAM and achieves better speech quality than do deep recurrent network and convolutional recurrent network.

Highlights

Speech enhancement has attracted considerable research attention for several decades
Obvious progress has been achieved in speech enhancement owing to the introduction of deep learning approaches, which outperform conventional methods, including spectral subtraction [1] and the Wiener filter method [2], which are based on stationary noise assumption
The results show that the recurrent-convolutional embedding encoder-decoder network (RC-EED) and the RC-EED combined an attention mechanism (RC-AEED) have good performance on speech enhancement task in low signal-to-noise ratio (SNR) conditions and them both apparently outperform a similar ASAM model in different noise conditions

Summary

INTRODUCTION

Speech enhancement has attracted considerable research attention for several decades. [5], [6] introduced the FCN structure to speech enhancement and found that the performance of the proposed convolutional encoder-decoder (CED) can exceed that of the structure with RNN and fully connected network. This study proposes the application of a unified auditory selection framework by modeling attention and memory [13] and an embedding decoder on speech enhancement, which is an LSTM and CNN combined structure. The results show that the RC-EED and the RC-EED combined an attention mechanism (RC-AEED) have good performance on speech enhancement task in low SNR conditions and them both apparently outperform a similar ASAM model in different noise conditions.

ALGORITHM DESCRIPTION

EMBEDDING ENCODER

EXPERIMENT SETUP

CONCLUSION