Combining Multi-Perspective Attention Mechanism With Convolutional Networks for Monaural Speech Enhancement

Tian Lan,Qiao Liu,Zenglin Xu,Wenzheng Ye,Yilan Lyu,Guoqiang Hui

doi:10.1109/access.2020.2989861

Abstract

The redundant convolutional encoder-decoder network has been proven useful in speech enhancement tasks. This network can capture the localized time-frequency details of speech signals through the fully convolutional network structure and the feature selection capability that results from the encoder-decoder mechanism. However, extracting informative features, which we regard as important for the representational capability of speech enhancement models, is not considered explicitly. To solve this problem, we introduce the attention mechanism into the convolutional encoder-decoder model to explicitly emphasize useful information from three aspects, namely, channel, space, and concurrent space-and-channel. Furthermore, the attention operation is specifically achieved through the squeeze-and-excitation mechanism and its variants. The model can adaptively emphasize valuable information and suppress useless ones by assigning weights from different perspectives according to global information, thereby improving its representational capability. Experimental results show that the proposed attention mechanisms can employ a small fraction of parameters to effectively improve the performance of CNN-based models compared with their normal versions, and generalize well to unseen noises, signal-to-noise ratios (SNR) and speakers. Among these mechanisms, the concurrent space-channel-wise attention exhibits the most significant improvement. And when comparing with the state-of-the-art, they can produce comparable or better results. We also integrate the proposed attention mechanisms with other convolutional neural network (CNN)-based models and gain performance. Moreover, we visualize the enhancement results to show the effect of the attention mechanisms more clearly.

Highlights

Speech enhancement aims to remove background noise from the degraded speech without distorting the clean speech, thereby improving the speech quality and intelligibility
Considering that many state-of-the-art convolutional neural network (CNN) speech enhancement methods have shortcuts [24], [25], we investigate the effect of the proposed SE mechanisms when combined with shortcut-based CNNs
Attention weight assignment is achieved through the SE mechanism, which improves speech enhancement performance by emphasizing valuable information

Summary

INTRODUCTION

Speech enhancement aims to remove background noise from the degraded speech without distorting the clean speech, thereby improving the speech quality and intelligibility. In speech enhancement, the use of detail information is essential in restoring clean speech To solve this problem, [23] proposed the redundant convolutional encoder-decoder (RCED), which discarded the max-pooling layers and the corresponding upsampling layers in the FCN to maintain the feature map size, thereby retaining the details and achieving improved performance. Roy et al [37] and Woo et al [26] extended SE to space and concurrent space-and-channel domains and achieved satisfactory results Motivated by these works, we introduce SE as the attention mechanism, and combine it with RCED, solving the problem that RCED has difficulty in effectively exploiting global information [25] or explicitly judging the importance of different features.

MODEL DESCRIPTION

SPACE-CHANNEL-WISE SE

EXPERIMENTAL CONFIGURATION

Findings

CONCLUSION