Abstract

In this paper, we present a new supervised speech enhancement approach based on a cooperative structure of deep autoencoders (DAEs) as generative models and deep neural networks (DNN). The DAE is used as a nonlinear alternative to nonnegative matrix factorization (NMF) for the extraction of harmonic structures and encoded features of the noise, clean and noisy signals, and a DNN is deployed as a nonlinear mapper. We introduce a deep network imitating NMF in a non-linear manner to overcome the problems of a simple linear model, such as performance degradation in non-stationary environments. Compared to combinatorial NMF and DNN methods, we do all of the decomposition, enhancement, and reconstruction processes in a nonlinear framework via a suitable cooperative structure of encoder, DNN, and decoders, and jointly optimize them. We also propose a supervised hierarchical multi-target training approach, performed in two steps, such that the DNN not only predicts the low-level encoded features as primary targets but it also predicts the high-level actual spectral signals as secondary targets. The first step acts as a pretraining for the second step which improves the learning strategy. Moreover, to exploit a more discriminative model for noise reduction, a DNN-based noise classification and fusion strategy (NCF) is also proposed. The experiments on TIMIT dataset reveal that the proposed methods outperform the previous approaches and achieve an average perceptual evaluation of speech quality (PESQ) improvement of up to about 0.3 for speech enhancement.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call