Abstract

Speech spectrograms exhibit strong contextual dependencies along both time and frequency dimensions. In this paper, a novel composite model integrating a long short-term memory (LSTM) and convolutional neural network (CNN) to exploit temporal and spectral contextual speech information, respectively, is proposed for speech separation. LSTM and CNN operate in a parallel fashion to speed up the process and independently extract a complementary set of speech features. A fully-connected network then maps these features to the real and imaginary components of a ratio mask to enhance the magnitude and phase of the corrupted speech simultaneously. In the CNN path, a new delicately designed CNN with frequency dilated one-dimension (1D) convolutional layers is employed to expand the receptive field of CNN kernels without increasing the complexity. Furthermore, this CNN benefits from residual learning and skip connections to facilitate training and accelerate convergence. In spite of different neural networks included in the composite model, the proposed separation system not only has a low computational complexity, but also significantly outperforms some other deep learning-based methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.