Abstract

Deep learning based speech separation usually uses a supervised algorithm to learn a mapping function from noisy features to separation targets. These separation targets, either ideal masks or magnitude spectrograms, have prominent spectro-temporal structures. Nonnegative matrix factorization (NMF) is a well-known representation learning technique that is capable of capturing the basic spectral structures. Therefore, the combination of deep learning and NMF as an organic whole is a smart strategy. However, previous methods typically use deep neural networks (DNN) and NMF for speech separation in a separate manner. In this paper, we propose a jointly combinatorial scheme to concentrate the strengths of both DNN and NMF for speech separation. NMF is used to learn the basis spectra that then are integrated into a DNN to directly reconstruct the magnitude spectrograms of speech and noise. Instead of predicting activation coefficients inferred by NMF, which is used as an intermediate target by the previous methods, DNN directly optimizes an actual separation objective in our system, so that the accumulated errors could be alleviated. Moreover, we explore a discriminative training objective with sparsity constraints to suppress noise and preserve more speech components further. Systematic experiments show that the proposed models are competitive with the previous methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call