Recent deep neural network (DNN) based single-channel speech enhancement methods have achieved remarkable results in the time-frequency (TF) magnitude domain. To further improve the quality and intelligibility of enhanced speech, the attention to phase enhancement is also increasing. In this paper, we propose a novel dilated convolutional network (DCN) model to simultaneously enhance the magnitude and phase of noisy speech. Unlike the direct complex spectral mapping methods, we take the complex spectrum of the signal as the main target and the ideal ratio mask (IRM) as the auxiliary target in a multi-target learning framework to achieve their complementary advantages. Firstly, a feature extraction module is introduced to achieve the fusion of local and long-term features. Two different targets are learned separately, but share the common feature extraction module, which is helpful to extract more general and suitable features. During the joint learning, the intermediate estimation of the IRM target in the auxiliary path, contributing as the attention gating factors, helps to distinguish the speech or non-speech components of the complex-valued signals in the main path. To leverage more fine-grained long-term contextual information, we introduce a multi-scale dilated convolution approach for feature encoding. Moreover, the proposed model is a causal system, which can fully meet the low latency requirements of real-time speech products. Experimental results show that, compared with other advanced systems, the proposed model not only has better speech denoising performance and phase estimation accuracy, but also generalizes better in the speaker, noise, and channel mismatch cases.
Read full abstract