Speech Separation Using Convolutional Neural Network and Attention Mechanism

Hu Zhao,Xue-Mei Sun,Chun-Miao Yuan

doi:10.1155/2020/2196893

Hu Zhao, Xue-Mei Sun + Show 1 more

Open Access

https://doi.org/10.1155/2020/2196893

Copy DOI

Abstract

Speech information is the most important means of human communication, and it is crucial to separate the target voice from the mixed sound signals. This paper proposes a speech separation model based on convolutional neural networks and attention mechanism. The magnitude spectrum of the mixed speech signals, as the input, has its high dimensionality. By analyzing the characteristics of the convolutional neural network and attention mechanism, it can be found that the convolutional neural network can effectively extract low-dimensional features and mine the spatiotemporal structure information in the speech signals, and the attention mechanism can reduce the loss of sequence information. The accuracy of speech separation can be improved effectively by combining two mechanisms. Compared to the typical speech separation model DRNN-2 + discrim, this method achieves 0.27 dB GNSDR gain and 0.51 dB GSIR gain, which illustrates that the speech separation model proposed in this paper has achieved an ideal separation effect.

Highlights

Voice information plays an increasingly important role in our lives, and voice communication becomes more and more frequent, such as using chatting software to send voice messages, using voice to control mobile phone applications, making mobile phone calls for voice communication, recognizing the singers from songs [1], and identifying singer’s information, lyrics, and song style [2, 3]. e goal of speech separation is to separate the mixed speech into two original speech signals
Compared with traditional speech methods, there are many advantages for deep neural network-based speech separation models. e main contribution of this paper is to apply the convolution neural network to the speech separation tasks, use the multilayer nonlinear processing structure of the convolution neural network to mine the structure information in the speech signals, automatically extract the abstract features, integrate the attention mechanism to reduce the loss of the sequence information, and achieve the monaural speech separation
Attention layer intelligibility and its perceived quality. en, the magnitude spectrum information is used as the input of the speech separation model. e magnitude spectrum is trained by convolutional neural network, and the region of interest of speech is extracted by the attention module

Summary

Introduction

Voice information plays an increasingly important role in our lives, and voice communication becomes more and more frequent, such as using chatting software to send voice messages, using voice to control mobile phone applications, making mobile phone calls for voice communication, recognizing the singers from songs [1], and identifying singer’s information, lyrics, and song style [2, 3]. e goal of speech separation is to separate the mixed speech into two original speech signals. Speech separation is a basic task with a wide range of applications, including mobile communication, speaker recognition, and song separation. Speech separation plays a more and more important role in speech processing, and more and more devices need to carry out speech separation task. Compared with traditional speech methods, there are many advantages for deep neural network-based speech separation models. E main contribution of this paper is to apply the convolution neural network to the speech separation tasks, use the multilayer nonlinear processing structure of the convolution neural network to mine the structure information in the speech signals, automatically extract the abstract features, integrate the attention mechanism to reduce the loss of the sequence information, and achieve the monaural speech separation Compared with traditional speech methods, there are many advantages for deep neural network-based speech separation models. e main contribution of this paper is to apply the convolution neural network to the speech separation tasks, use the multilayer nonlinear processing structure of the convolution neural network to mine the structure information in the speech signals, automatically extract the abstract features, integrate the attention mechanism to reduce the loss of the sequence information, and achieve the monaural speech separation

Relative Research

Evaluation

Proposed Methods

Simulation Experiments and Analysis