Abstract

Speech enhancement is an essential task for improving the quality and intelligibility of speech signals corrupted by noise. Current deep neural network-based speech enhancement methods have achieved remarkable results. However, most of them only work in the time domain or the time–frequency domain, which do not fully use the complementary advantages of the two domains. In this paper, we propose a framework with joint waveform and magnitude processing for single-channel speech enhancement, which can realize the complementary advantages of the time-domain and time–frequency features. Specifically, the proposed network adopts a triple-stage training strategy. In the first two stages, the two sub-networks take waveform and magnitude as input features to generate two pre-enhanced speech signals, respectively. In the third stage, a fusion sub-network is used to fuse the two pre-enhanced speech signals, which further improves the quality and intelligibility of speech. All three sub-networks are encoder-decoder-based architecture with skip connections, and an additional temporal convolutional network is inserted between the encoder and the decoder. In order to improve the information flow of the network, we introduce the gating mechanism into the temporal convolutional network, which we refer to as the gated temporal convolutional network. In addition, the recently popular group communication strategy is also introduced into the network, which significantly reduces the number of trainable parameters and obtains on-par or better enhancement performance. Experimental results demonstrate that the proposed method consistently outperforms other advanced baselines in terms of objective speech quality and intelligibility metrics. Moreover, the proposed model also achieves outstanding cross-corpus and cross-language generalization capabilities.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call