LSTM-convolutional-BLSTM encoder-decoder network for minimum mean-square error approach to speech enhancement

Zeyu Wang,Tao Zhang,Yangyang Shao,Biyun Ding

doi:10.1016/j.apacoust.2020.107647

Abstract

In recent years, deep learning models have been employed for speech enhancement. Most of the existing methods based on deep learning use fully Convolutional Neural Network (CNN) to capture time–frequency information of input features. Compared with CNNs, it is more reasonable to use Long Short-Term Memory (LSTM) network to capture contextual information on the time axis of features. However, the computation load of a fully LSTM structure is heavy. To balance the model complexity and the capability of capturing time–frequency features, we present an LSTM-Convolutional-BLSTM Encoder-Decoder (LCLED) network for speech enhancement. The LCLED additionally incorporates transpose convolution and skip connection. The key idea is that we use two LSTM parts and convolutional layers to model the contextual information and frequency dimension features, respectively. Furthermore, in order to achieve a higher quality of enhanced speech, a priori Signal-to-Noise Ratio (SNR) is applied as the learning target of LCLED. The Minimum Mean-Square Error (MMSE) approach is used for postprocessing. The results indicate that the proposed LCLED not only reduces the model complexity and training time but also improves the quality and the intelligibility of enhanced speech compared with the fully LSTM structure.

Full Text