Speech Separation Using a Composite Model for Complex Mask Estimation

Mojtaba Hasannezhad,Zhiheng Ouyang,Wei-Ping Zhu,Benoit Champagne

doi:10.1109/mwscas48704.2020.9184645

Abstract

Speech spectrograms exhibit strong contextual dependencies along both time and frequency dimensions. In this paper, a novel composite model integrating a long short-term memory (LSTM) and convolutional neural network (CNN) to exploit temporal and spectral contextual speech information, respectively, is proposed for speech separation. LSTM and CNN operate in a parallel fashion to speed up the process and independently extract a complementary set of speech features. A fully-connected network then maps these features to the real and imaginary components of a ratio mask to enhance the magnitude and phase of the corrupted speech simultaneously. In the CNN path, a new delicately designed CNN with frequency dilated one-dimension (1D) convolutional layers is employed to expand the receptive field of CNN kernels without increasing the complexity. Furthermore, this CNN benefits from residual learning and skip connections to facilitate training and accelerate convergence. In spite of different neural networks included in the composite model, the proposed separation system not only has a low computational complexity, but also significantly outperforms some other deep learning-based methods.

Full Text